Tag Archives: scholarsphere

Long-term Retention Policies in ScholarSphere and Other Institutional Repositories

For this blog post, I’ve asked Patricia Gael, graduate assistant in Publishing and Curation Services, to write about her recent survey of retention policies in institutional repositories – a benchmarking exercise toward understanding what should inform our own retention policies for ScholarSphere. Comments and questions are welcome!
~ Patricia Hswe, co-lead, Publishing and Curation Services

I’m happy to have been invited to write a guest post for the Content Stewardship blog! As a graduate assistant in Publishing and Curation Services and a doctoral candidate amassing large amounts of data while writing my dissertation, I’ve been watching ScholarSphere’s development with interest. My current use of ScholarSphere could be called “experimental”: I’ve deposited a few trial items, but I haven’t yet uploaded any of my research. As I consider my future use of the repository, I’ve been thinking about the longevity of my data. This post is written from my viewpoint as a prospective ScholarSphere user. 

One of the questions Patricia Hswe and Linda Friend have been receiving at their ScholarSphere demonstrations is, “how long will my content remain accessible in ScholarSphere?” The short answer, and the one they’ve been giving, is, “as long as you leave it there”; the ScholarSphere team is committed to archiving and preserving deposited data, and, unless a user chooses to delete his or her own content, all items will remain safely in the repository. For all current, practical purposes this is true. But a longer and more detailed response to the “how long?” question would need to include phrases like “for the foreseeable future” and “as far as we know.” 

ScholarSphere is a new and still-in-development service and Penn State is still working to figure out what it will look like in the future. We are not alone in facing the challenges of long-term data archiving; many other universities will be making similar decisions about the retention of the data in their repositories. However, a recent quick survey of repository preservation policies suggests that retention strategies remain indefinite and unstandardized. 
Many preservation policies focus on the types of files uploaded and the likelihood that those files will be usable in the future. The University of Michigan’s DeepBlue, for example, provides “three levels of preservation support for specific file formats” that are determined by “a set of evaluation criteria including prevalence of the file format in the marketplace, whether the format is proprietary, the availability of tools for emulation or migration and the availability of local resources to take specific preservation actions.” These filetype-based preservation policies are helpful. Users should be aware of the technical reasons their data might not remain usable so that they can decide whether to adjust the formats in which they’re storing their data. But even data stored in the most secure formats cannot be guaranteed forever. 
Most institutional repositories are not clear about just how long data stored in their repositories will remain usable and searchable. Many use phrases like “persistent access” (the University of Pennsylvania’s ScholarlyCommons); “long-term preservation” (University of California’s e-Scholarship, Texas A&M’s Digital Repository, and the University of Michigan’s DeepBlue); “continuing access” (the University of Kansas’s KU ScholarWorks); or they state that the information will be held “indefinitely” (as at the University of Maryland’s DRUM and the University of Florida’s Institutional Repository). Penn State’s policy is in-line with those offered elsewhere: we assert that “Penn State Libraries and Information Technology Services are committed to providing long-term access to all material submitted to ScholarSphere.” One can see why users might be concerned about what statements like these really promise. Very few repositories discuss a timeline for storage (one exception is Purdue’s PURR, which allocates repository space based on the nature of the stored data and grant funding, with storage timelines from three years to ten years or the length of the grant).
Electronic storage is limited. For ScholarSphere, these limitations are not an immediate concern, but just as librarians need to cull their physical collections, we know that we might eventually need criteria for determining which items should be preservation priorities. An item’s popularity might be one measure of its utility. How many times has the file been downloaded? When was the most recent download? But the utility of a file’s content can be just as significant. How can we determine what is obsolete and what will continue to be useful? A dataset downloaded by one person who uses it to publish a new article might be as valuable or more than a document downloaded by one hundred people who only read it once. The availability of the data is also a concern. How can we know whether ScholarSphere content exists elsewhere? Removing an old file from ScholarSphere might wipe it from the Internet entirely, or it might eliminate just one of many instances. 
For now we have no straightforward solutions to these long-term problems, but we will continue to analyze and clarify our policies as ScholarSphere evolves. We would love to hear any suggestions or questions you might have!

Digital Preservation and ScholarSphere

When I served as Institutional Repository Coordinator at Duke University, one frequently asked question I received was “What is an Institutional Repository?” My stock answer was that it was an access and discovery platform for Duke faculty and student scholarship as well as born digital institutional records.  The follow-up question almost always had to do with preservation of the content; that answer was usually a referral to a list of preferred formats for deposit.

As we head towards the launch of Penn State’s IR, ScholarSphere, these questions now loom large for us. My stock answer at Duke also applies for ScholarSphere as it will offer access and discovery for faculty and student scholarship. ScholarSphere is also built on a robust platform that allows for flexible preservation services. So what is the baseline for content preservation offered by ScholarSphere?

First, all content made available on ScholarSphere will have redundant back-up. All files deposited will get a SHA-1 (Secure Hash Algorithm) checksum which is essentially a digital “fingerprint” in the form of a string of characters that can be generated for any digital file. If the file changes in any way that digital signature will change, indicating the alteration. In addition, ScholarSphere uses FITS (File Information Tool Set) to identify, validate, and extract technical (and some descriptive) metadata from the file, identifying the file type, version, and other information that helps us manage the file. Regular fixity checks will be run against the files to check for changes, such as file corruption. Beyond this initial level of preserving the file for access and discovery, additional preservation services are in the planning stages.

What might these additional preservation services entail? Depending on the Library’s commitment to the files submitted, we may look at normalizing files into standard formats to facilitate the migration of files as formats become obsolete, such as migrating all Word files (such as .docx) to a format like PDF/A, the ISO standardized version of Portable Document Format (PDF). A higher level of preservation would be to preserve both the source file and the normalized copy. For some scholarly works such as certain types of data sets, preservation or emulation of the software used to create the files may also be needed to carry the content forward through time.

The main drivers for the adoption of additional preservation services such as these will be policy and resources. Each of the services listed above requires increasing amounts of resources (staff, expertise, and IT tools) to accomplish. Just as we have policies that guide us in the building and preserving of analog collections as well as limited resources to implement those policies, the same is true with the digital content collected for ScholarSphere. Policy can also help creators make informed decisions with regard to technologies and formats used for their work, which could potentially ease the amount of resources required and enhance the longevity of scholarly content. As ScholarSphere evolves, the Library will be prepared to suggest best practices with regard to different documentary types and file formats.



Launching ScholarSphere

The repository services project to which other posts have alluded now has a name: ScholarSphere.

Penn State ScholarSphere is a new research repository service offered by the University Libraries and Information Technology Services, enabling Penn State faculty, staff, and students to share their scholarly works such as research datasets, working papers, research reports, and image collections, to name a few examples. ScholarSphere will make these works more discoverable, accessible, usable, and thus broadly recognized and known. 

The ScholarSphere service will help researchers actively manage stored versions of their research and preserve it, ensuring its longevity over time for future generations of scholars to find, use, and build on. The preservation functions include scheduled and on-demand verifications of deposited works, characterization of files to  mitigate future format obsolescence, regular file backups, and replication to disaster recovery sites.
The repository renders research works immediately citable via stable, short URLs and metadata about research is immediately exportable to citation managers. ScholarSphere enables documentation and description of research data for optimal discovery and curation of data through their lifecycle of use and reuse. 
Researchers will be able to share works stored in ScholarSphere with the Penn State community either by sharing directly with specified individuals or with established groups. Researchers will also be able to share each of their files at different access levels including read-only and edit modes, allowing full control over who can view and edit deposited works. 
A trusted institutional service, ScholarSphere has safeguards in place for keeping private research secure and unchanged over time, as researchers warrant, as well as for keeping access restricted to the individual researcher. 
ScholarSphere will be undergoing usability and accessibility testing throughout the summer for a beta release in September of 2012.  Stay tuned for more information about the ScholarSphere launch and about the technologies underlying ScholarSphere.