Long-term Retention Policies in ScholarSphere and Other Institutional Repositories

For this blog post, I’ve asked Patricia Gael, graduate assistant in Publishing and Curation Services, to write about her recent survey of retention policies in institutional repositories – a benchmarking exercise toward understanding what should inform our own retention policies for ScholarSphere. Comments and questions are welcome!
~ Patricia Hswe, co-lead, Publishing and Curation Services

I’m happy to have been invited to write a guest post for the Content Stewardship blog! As a graduate assistant in Publishing and Curation Services and a doctoral candidate amassing large amounts of data while writing my dissertation, I’ve been watching ScholarSphere’s development with interest. My current use of ScholarSphere could be called “experimental”: I’ve deposited a few trial items, but I haven’t yet uploaded any of my research. As I consider my future use of the repository, I’ve been thinking about the longevity of my data. This post is written from my viewpoint as a prospective ScholarSphere user. 

One of the questions Patricia Hswe and Linda Friend have been receiving at their ScholarSphere demonstrations is, “how long will my content remain accessible in ScholarSphere?” The short answer, and the one they’ve been giving, is, “as long as you leave it there”; the ScholarSphere team is committed to archiving and preserving deposited data, and, unless a user chooses to delete his or her own content, all items will remain safely in the repository. For all current, practical purposes this is true. But a longer and more detailed response to the “how long?” question would need to include phrases like “for the foreseeable future” and “as far as we know.” 

ScholarSphere is a new and still-in-development service and Penn State is still working to figure out what it will look like in the future. We are not alone in facing the challenges of long-term data archiving; many other universities will be making similar decisions about the retention of the data in their repositories. However, a recent quick survey of repository preservation policies suggests that retention strategies remain indefinite and unstandardized. 
Many preservation policies focus on the types of files uploaded and the likelihood that those files will be usable in the future. The University of Michigan’s DeepBlue, for example, provides “three levels of preservation support for specific file formats” that are determined by “a set of evaluation criteria including prevalence of the file format in the marketplace, whether the format is proprietary, the availability of tools for emulation or migration and the availability of local resources to take specific preservation actions.” These filetype-based preservation policies are helpful. Users should be aware of the technical reasons their data might not remain usable so that they can decide whether to adjust the formats in which they’re storing their data. But even data stored in the most secure formats cannot be guaranteed forever. 
Most institutional repositories are not clear about just how long data stored in their repositories will remain usable and searchable. Many use phrases like “persistent access” (the University of Pennsylvania’s ScholarlyCommons); “long-term preservation” (University of California’s e-Scholarship, Texas A&M’s Digital Repository, and the University of Michigan’s DeepBlue); “continuing access” (the University of Kansas’s KU ScholarWorks); or they state that the information will be held “indefinitely” (as at the University of Maryland’s DRUM and the University of Florida’s Institutional Repository). Penn State’s policy is in-line with those offered elsewhere: we assert that “Penn State Libraries and Information Technology Services are committed to providing long-term access to all material submitted to ScholarSphere.” One can see why users might be concerned about what statements like these really promise. Very few repositories discuss a timeline for storage (one exception is Purdue’s PURR, which allocates repository space based on the nature of the stored data and grant funding, with storage timelines from three years to ten years or the length of the grant).
Electronic storage is limited. For ScholarSphere, these limitations are not an immediate concern, but just as librarians need to cull their physical collections, we know that we might eventually need criteria for determining which items should be preservation priorities. An item’s popularity might be one measure of its utility. How many times has the file been downloaded? When was the most recent download? But the utility of a file’s content can be just as significant. How can we determine what is obsolete and what will continue to be useful? A dataset downloaded by one person who uses it to publish a new article might be as valuable or more than a document downloaded by one hundred people who only read it once. The availability of the data is also a concern. How can we know whether ScholarSphere content exists elsewhere? Removing an old file from ScholarSphere might wipe it from the Internet entirely, or it might eliminate just one of many instances. 
For now we have no straightforward solutions to these long-term problems, but we will continue to analyze and clarify our policies as ScholarSphere evolves. We would love to hear any suggestions or questions you might have!