1. What are the main ways in which the architecture of ArchiveSphere will differ from that of ScholarSphere?
In terms of system architecture, ArchiveSphere and ScholarSphere, though they will live on different machines due to the extra level of protection we need for the data in ArchiveSphere, are identical: both are Rails web applications that speak to Fedora, as an asset management system with preservation functions, and Solr, as a search index, via a suite of community-developed Ruby components. That is, they’re both Hydra applications.
In terms of software architecture, there’s a lot of overlap. Both are based on a gem called Sufia, which was originally developed as the “guts” of ScholarSphere, and is now used by nearly a dozen institutions to power their own repository applications. And both use core Hydra components such as hydra-head, active-fedora, and blacklight. We have also been working on two new community components called hydra-collections and hydra-derivatives, which will again be used across both of our spheres and within the Hydra community.
2. What metadata schemas will be used with ArchiveSphere?
We’re looking at PREMIS implementation for preservation metadata — and, in particular, the RDF-based version — which is an exciting challenge to consider. Descriptive metadata needs a little more fleshing out. At the object level in ScholarSphere we primarily use the RDF-based Dublin Core terms vocabulary (with a couple of other elements thrown in where DC had gaps), but ArchiveSphere is an archival repository, and one of the aspects that sets it apart from our work with ScholarSphere is the need to consider aggregate metadata — the description we assign to collections, series, boxes, etc., all of which apply to the individual objects as well.
Obviously, EAD is the standard we have to work with here, but we want to look at ways to make it less EAD-ish on the public side, and more integrated with ArchivesSpace on the administrative end.
3. Is the team considering building in support for forensic processing? What about automated metadata extraction?
Forensic processing is very much on the radar of our group, but it’s not a high priority. Why? For one, Penn State is still in the process of hashing out its forensic workflows, yet development on ArchiveSphere has already started. Furthermore, we feel that other Hydra partners with more mature forensic workflows may be better positioned to take what we’re doing with ArchiveSphere and work in their digital forensic concerns (UVa, Stanford, etc.), which we can then adopt/adapt locally (the joys of community development!). Finally, we have a lot of material that has come to us (and material that continues to be acquired) virtually, without any use of physical media at all, or using physical media supplied by the archives. These constitute our largest born-digital collections.
For now, we plan to create disk images using local workflows, while keeping an eye on how our development might evolve to include disk images down the line.
We are planning to use FITS (File Information Tool Set, an open source characterization tool from Harvard U. Libraries) to extract metadata and file format features from deposited files, and Tika for full-text extraction and search.
4. Is the team considering incorporating any of the functionality of tools like Archivematica and/or BitCurator?
ArchiveSphere draws a lot of its influence from the work of Archivematica, and while there is some overlap in the functionality between the two, we also feel that there are certain benefits to developing Amatica-like tools within the Hydra framework and community rather than seriously considering integration between Archivematica and ArchiveSphere, which run on different technologies and have different architectural and workflow assumptions.
BitCurator: TBD. Penn State is moving toward adoption of the BitCurator tools for its developing forensic workflows, so opportunities may arise. Archivematica will have support for disk images soon, but it won’t actually wrap the BC tools into their interface (as far as we know). There might be some future development opportunities with Hydra here, but we feel this level of forensic tool integration with repository apps is still a ways off in the profession.
5. Are you planning to facilitate various layers of access? For example, items restricted to Penn State users, items restricted to a group defined by the archivist, items kept dark for a period of time, only metadata (not bitstream) accessible to public, certain administrative metadata fields hidden from public, etc.?
Yes, ALL of the above.
Nuanced access options are highly desirable, and one of the primary motivations for this project. The first phase of the project, however, is a back-office tool only, so we may not build out this level of access controls in our first release. It should be noted that Hydra already provides tooling for most of this out of the box, so it’s not “hard,” it’s mostly that we have lots of other priorities for the first phase of the project.
6. Are you planning to facilitate the rendering of various file types within the user interface? For example, video, audio, CAD?
Our collection development priorities and use cases will drive these decisions. For instance, we don’t have CAD files, so no, but we do have design files, so we will need to accommodate formats from QuarkXPress and InDesign. We do have audio and video to consider, but it’s an open question whether we’ll develop custom functionality for this or build on tools developed by other Hydra partners (such as Indiana’s and Northwestern’s work on Avalon, and WGBH’s work on their audiovisual repository).
7. Are you planning to incorporate automated derivative creation for access copies?
Yep. We realize there is some debate about when the best time to do this kind of transformation is, but sorting out the costs and benefits is murky, and for now we’re operating on the assumption that normalization for access will occur at the point of ingest at the same time as normalization for preservation. (We are using a brand new Ruby component for this called hydra-derivatives.)
8. Are you planning to facilitate reuse of the collection material by classes and faculty members? For example, in the way that Northwestern’s Digital Image Library proposes to enable users to create their own image galleries. What about support for user-generated remixing and analysis of data, such as visualizations, data mining, mapping, timeline creation, etc.?
Right now, our delivery and access plans are archivally-focused. But once we can deliver born-digital collection materials, wrap them up with metadata about both digital and analog materials, and even begin to incorporate digitized material, we wonder: what is the utility of the traditional finding aid format? We’ll take it as our starting point and then try to disrupt it in strategic ways, and some of these might/should include visualization tools, or interfaces matched to the particular characteristics and usage needs of a particular genre (think email). But still TBD, as it’s not part of current planning cycles.
Related: we’ve had requests for corpus data sets (e.g. digital newspapers) from remote researchers, so accommodating such needs is definitely on our radar, but as you can see, a lot of features are on our radar and we’re working on multiple Hydra apps concurrently, so we have to push some off for now.
9. Are you planning to incorporate support for web archiving and access to web archive records?
Penn State is an Archive-It partner, but web captures are archived remotely on Archive-It’s servers. Archive-It does provide WARC export of web captures at the end of the partnership (and we are exploring the use of tools like WARCreate to generate local WARC files), but we am have not seen many impressive exemplars of how to deliver WARC for access in locally developed systems. At this point, because of our partnership with A-It, it’s not a high priority use case.
10. Are you planning to facilitate the collection of content as it is created? For example, collecting materials from faculty members as they create their born-digital manuscripts, rather than waiting until they retire to accession them.
Absolutely! We need to work out how material deposited in ScholarSphere ultimately flows into a formal archival repository (university archives/ArchiveSphere), but at a much higher level, we simply need to work out the organizational collaboration that gets everyone on the same page about this. Furthermore, our collecting of faculty members would have to become more active and less reactive (which is not a knock on our fantastic university archives program but an acknowledgement of a problem all archives currently face). We also want to enable this kind of functionality for offices to deposit files into the institutional/university archives.