Digital Preservation and ScholarSphere

When I served as Institutional Repository Coordinator at Duke University, one frequently asked question I received was “What is an Institutional Repository?” My stock answer was that it was an access and discovery platform for Duke faculty and student scholarship as well as born digital institutional records.  The follow-up question almost always had to do with preservation of the content; that answer was usually a referral to a list of preferred formats for deposit.

As we head towards the launch of Penn State’s IR, ScholarSphere, these questions now loom large for us. My stock answer at Duke also applies for ScholarSphere as it will offer access and discovery for faculty and student scholarship. ScholarSphere is also built on a robust platform that allows for flexible preservation services. So what is the baseline for content preservation offered by ScholarSphere?

First, all content made available on ScholarSphere will have redundant back-up. All files deposited will get a SHA-1 (Secure Hash Algorithm) checksum which is essentially a digital “fingerprint” in the form of a string of characters that can be generated for any digital file. If the file changes in any way that digital signature will change, indicating the alteration. In addition, ScholarSphere uses FITS (File Information Tool Set) to identify, validate, and extract technical (and some descriptive) metadata from the file, identifying the file type, version, and other information that helps us manage the file. Regular fixity checks will be run against the files to check for changes, such as file corruption. Beyond this initial level of preserving the file for access and discovery, additional preservation services are in the planning stages.

What might these additional preservation services entail? Depending on the Library’s commitment to the files submitted, we may look at normalizing files into standard formats to facilitate the migration of files as formats become obsolete, such as migrating all Word files (such as .docx) to a format like PDF/A, the ISO standardized version of Portable Document Format (PDF). A higher level of preservation would be to preserve both the source file and the normalized copy. For some scholarly works such as certain types of data sets, preservation or emulation of the software used to create the files may also be needed to carry the content forward through time.

The main drivers for the adoption of additional preservation services such as these will be policy and resources. Each of the services listed above requires increasing amounts of resources (staff, expertise, and IT tools) to accomplish. Just as we have policies that guide us in the building and preserving of analog collections as well as limited resources to implement those policies, the same is true with the digital content collected for ScholarSphere. Policy can also help creators make informed decisions with regard to technologies and formats used for their work, which could potentially ease the amount of resources required and enhance the longevity of scholarly content. As ScholarSphere evolves, the Library will be prepared to suggest best practices with regard to different documentary types and file formats.