Tag Archives: digital repository

ArchiveSphere FAQs

1. What are the main ways in which the architecture of ArchiveSphere will differ from that of ScholarSphere? 

In terms of system architecture, ArchiveSphere and ScholarSphere, though they will live on different machines due to the extra level of protection we need for the data in ArchiveSphere, are identical: both are Rails web applications that speak to Fedora, as an asset management system with preservation functions, and Solr, as a search index, via a suite of community-developed Ruby components. That is, they’re both Hydra applications.  

In terms of software architecture, there’s a lot of overlap.  Both are based on a gem called Sufia, which was originally developed as the “guts” of ScholarSphere, and is now used by nearly a dozen institutions to power their own repository applications.  And both use core Hydra components such as hydra-head, active-fedora, and blacklight.  We have also been working on two new community components called hydra-collections and hydra-derivatives, which will again be used across both of our spheres and within the Hydra community.

2. What metadata schemas will be used with ArchiveSphere?

We’re looking at PREMIS implementation for preservation metadata — and, in particular, the RDF-based version — which is an exciting challenge to consider. Descriptive metadata needs a little more fleshing out. At the object level in ScholarSphere we primarily use the RDF-based Dublin Core terms vocabulary (with a couple of other elements thrown in where DC had gaps), but ArchiveSphere is an archival repository, and one of the aspects that sets it apart from our work with ScholarSphere is the need to consider aggregate metadata — the description we assign to collections, series, boxes, etc., all of which apply to the individual objects as well.

Obviously, EAD is the standard we have to work with here, but we want to look at ways to make it less EAD-ish on the public side, and more integrated with ArchivesSpace on the administrative end.

3. Is the team considering building in support for forensic processing? What about automated metadata extraction?

Forensic processing is very much on the radar of our group, but it’s not a high priority. Why? For one, Penn State is still in the process of hashing out its forensic workflows, yet development on ArchiveSphere has already started. Furthermore, we feel that other Hydra partners with more mature forensic workflows may be better positioned to take what we’re doing with ArchiveSphere and work in their digital forensic concerns (UVa, Stanford, etc.), which we can then adopt/adapt locally (the joys of community development!). Finally, we have a lot of material that has come to us (and material that continues to be acquired) virtually, without any use of physical media at all, or using physical media supplied by the archives. These constitute our largest born-digital collections.

For now, we plan to create disk images using local workflows, while keeping an eye on how our development might evolve to include disk images down the line. 

We are planning to use FITS (File Information Tool Set, an open source characterization tool from Harvard U. Libraries) to extract metadata and file format features from deposited files, and Tika for full-text extraction and search.

4. Is the team considering incorporating any of the functionality of tools like Archivematica and/or BitCurator?

ArchiveSphere draws a lot of its influence from the work of Archivematica, and while there is some overlap in the functionality between the two, we also feel that there are certain benefits to developing Amatica-like tools within the Hydra framework and community rather than seriously considering integration between Archivematica and ArchiveSphere, which run on different technologies and have different architectural and workflow assumptions.

BitCurator: TBD. Penn State is moving toward adoption of the BitCurator tools for its developing forensic workflows, so opportunities may arise. Archivematica will have support for disk images soon, but it won’t actually wrap the BC tools into their interface (as far as we know). There might be some future development opportunities with Hydra here, but we feel this level of forensic tool integration with repository apps is still a ways off in the profession.

5. Are you planning to facilitate various layers of access? For example, items restricted to Penn State users, items restricted to a group defined by the archivist, items kept dark for a period of time, only metadata (not bitstream) accessible to public, certain administrative metadata fields hidden from public, etc.?

Yes, ALL of the above.

Nuanced access options are highly desirable, and one of the primary motivations for this project.  The first phase of the project, however, is a back-office tool only, so we may not build out this level of access controls in our first release.  It should be noted that Hydra already provides tooling for most of this out of the box, so it’s not “hard,” it’s mostly that we have lots of other priorities for the first phase of the project.

6. Are you planning to facilitate the rendering of various file types within the user interface? For example, video, audio, CAD?

Our collection development priorities and use cases will drive these decisions. For instance, we don’t have CAD files, so no, but we do have design files, so we will need to accommodate formats from QuarkXPress and InDesign. We do have audio and video to consider, but it’s an open question whether we’ll develop custom functionality for this or build on tools developed by other Hydra partners (such as Indiana’s and Northwestern’s work on Avalon, and WGBH’s work on their audiovisual repository).

7. Are you planning to incorporate automated derivative creation for access copies?

Yep. We realize there is some debate about when the best time to do this kind of transformation is, but sorting out the costs and benefits is murky, and for now we’re operating on the assumption that normalization for access will occur at the point of ingest at the same time as normalization for preservation.   (We are using a brand new Ruby component for this called hydra-derivatives.) 

8. Are you planning to facilitate reuse of the collection material by classes and faculty members? For example, in the way that Northwestern’s Digital Image Library proposes to enable users to create their own image galleries. What about support for user-generated remixing and analysis of data, such as visualizations, data mining, mapping, timeline creation, etc.?

Right now, our delivery and access plans are archivally-focused. But once we can deliver born-digital collection materials, wrap them up with metadata about both digital and analog materials, and even begin to incorporate digitized material, we wonder: what is the utility of the traditional finding aid format? We’ll take it as our starting point and then try to disrupt it in strategic ways, and some of these might/should include visualization tools, or interfaces matched to the particular characteristics and usage needs of a particular genre (think email). But still TBD, as it’s not part of current planning cycles. 

Related: we’ve had requests for corpus data sets (e.g. digital newspapers) from remote researchers, so accommodating such needs is definitely on our radar, but as you can see, a lot of features are on our radar and we’re working on multiple Hydra apps concurrently, so we have to push some off for now.

9. Are you planning to incorporate support for web archiving and access to web archive records?

Penn State is an Archive-It partner, but web captures are archived remotely on Archive-It’s servers. Archive-It does provide WARC export of web captures at the end of the partnership (and we are exploring the use of tools like WARCreate to generate local WARC files), but we am have not seen many impressive exemplars of how to deliver WARC for access in locally developed systems. At this point, because of our partnership with A-It, it’s not a high priority use case.

10. Are you planning to facilitate the collection of content as it is created? For example, collecting materials from faculty members as they create their born-digital manuscripts, rather than waiting until they retire to accession them.

Absolutely! We need to work out how material deposited in ScholarSphere ultimately flows into a formal archival repository (university archives/ArchiveSphere), but at a much higher level, we simply need to work out the organizational collaboration that gets everyone on the same page about this. Furthermore, our collecting of faculty members would have to become more active and less reactive (which is not a knock on our fantastic university archives program but an acknowledgement of a problem all archives currently face). We also want to enable this kind of functionality for offices to deposit files into the institutional/university archives.

Introducing ArchiveSphere

ArchiveSphere is the given name for a project between Penn State University Libraries and Information Technology Services. “Sphere” brands the project as being part of a set of repository services built using Project Hydra technologies; our first such service was ScholarSphere. “Archive” conveys that the project will create services for preserving, managing, and providing access to digital objects, in a way that is informed by archival thinking and practices.

Just what does this mean? We’ve spent the first few months of the project figuring that out ourselves. Many institutions have tested the utility of repository applications like DSpace or Fedora to store and deliver digital objects acquired as part of larger (and largely analog) archival collections. But what are the characteristics of storage and delivery?

With ArchiveSphere, we will provide a platform that archivists can use to deposit hierarchies of digital material from legacy media. The system will preserve the relational and hierarchical connections between files, while also providing archivists with tools that permit rearrangement and classification. Preservation actions like file characterization and normalization will be automated, as will virus checking and provenance event logging. We will leverage existing collection-level description found in collection management tools in a way that makes explicit the connections between digital and analog materials in hybrid collections. Access mechanisms will be provided that build on some of the great features found in ScholarSphere such as persistent unique identifiers, full-text indexing, integrity checking, and simple deposit, while also leveraging traditional archival discovery mechanisms like finding aids.

We’ve identified four main phases for development: 1) ingest and preservation services for archives staff, 2) administrative tools for managing, arranging, and describing submissions for public access and discovery interfaces, 3) integration with ArchivesSpace for holistic management of archival context around repository materials, and 4) alternative submission tools, including self-deposit options for institutional records.

Requirements for phases 2-4 are still in development, and development on phase 1 will begin this summer. (Note that work on phase 1 is focused on an administrative interface rather than a public interface.)

Posted on behalf of the ArchiveSphere project team.

Long-term Retention Policies in ScholarSphere and Other Institutional Repositories

For this blog post, I’ve asked Patricia Gael, graduate assistant in Publishing and Curation Services, to write about her recent survey of retention policies in institutional repositories – a benchmarking exercise toward understanding what should inform our own retention policies for ScholarSphere. Comments and questions are welcome!
~ Patricia Hswe, co-lead, Publishing and Curation Services

I’m happy to have been invited to write a guest post for the Content Stewardship blog! As a graduate assistant in Publishing and Curation Services and a doctoral candidate amassing large amounts of data while writing my dissertation, I’ve been watching ScholarSphere’s development with interest. My current use of ScholarSphere could be called “experimental”: I’ve deposited a few trial items, but I haven’t yet uploaded any of my research. As I consider my future use of the repository, I’ve been thinking about the longevity of my data. This post is written from my viewpoint as a prospective ScholarSphere user. 

One of the questions Patricia Hswe and Linda Friend have been receiving at their ScholarSphere demonstrations is, “how long will my content remain accessible in ScholarSphere?” The short answer, and the one they’ve been giving, is, “as long as you leave it there”; the ScholarSphere team is committed to archiving and preserving deposited data, and, unless a user chooses to delete his or her own content, all items will remain safely in the repository. For all current, practical purposes this is true. But a longer and more detailed response to the “how long?” question would need to include phrases like “for the foreseeable future” and “as far as we know.” 

ScholarSphere is a new and still-in-development service and Penn State is still working to figure out what it will look like in the future. We are not alone in facing the challenges of long-term data archiving; many other universities will be making similar decisions about the retention of the data in their repositories. However, a recent quick survey of repository preservation policies suggests that retention strategies remain indefinite and unstandardized. 
Many preservation policies focus on the types of files uploaded and the likelihood that those files will be usable in the future. The University of Michigan’s DeepBlue, for example, provides “three levels of preservation support for specific file formats” that are determined by “a set of evaluation criteria including prevalence of the file format in the marketplace, whether the format is proprietary, the availability of tools for emulation or migration and the availability of local resources to take specific preservation actions.” These filetype-based preservation policies are helpful. Users should be aware of the technical reasons their data might not remain usable so that they can decide whether to adjust the formats in which they’re storing their data. But even data stored in the most secure formats cannot be guaranteed forever. 
Most institutional repositories are not clear about just how long data stored in their repositories will remain usable and searchable. Many use phrases like “persistent access” (the University of Pennsylvania’s ScholarlyCommons); “long-term preservation” (University of California’s e-Scholarship, Texas A&M’s Digital Repository, and the University of Michigan’s DeepBlue); “continuing access” (the University of Kansas’s KU ScholarWorks); or they state that the information will be held “indefinitely” (as at the University of Maryland’s DRUM and the University of Florida’s Institutional Repository). Penn State’s policy is in-line with those offered elsewhere: we assert that “Penn State Libraries and Information Technology Services are committed to providing long-term access to all material submitted to ScholarSphere.” One can see why users might be concerned about what statements like these really promise. Very few repositories discuss a timeline for storage (one exception is Purdue’s PURR, which allocates repository space based on the nature of the stored data and grant funding, with storage timelines from three years to ten years or the length of the grant).
Electronic storage is limited. For ScholarSphere, these limitations are not an immediate concern, but just as librarians need to cull their physical collections, we know that we might eventually need criteria for determining which items should be preservation priorities. An item’s popularity might be one measure of its utility. How many times has the file been downloaded? When was the most recent download? But the utility of a file’s content can be just as significant. How can we determine what is obsolete and what will continue to be useful? A dataset downloaded by one person who uses it to publish a new article might be as valuable or more than a document downloaded by one hundred people who only read it once. The availability of the data is also a concern. How can we know whether ScholarSphere content exists elsewhere? Removing an old file from ScholarSphere might wipe it from the Internet entirely, or it might eliminate just one of many instances. 
For now we have no straightforward solutions to these long-term problems, but we will continue to analyze and clarify our policies as ScholarSphere evolves. We would love to hear any suggestions or questions you might have!