ArchiveSphere FAQs

1. What are the main ways in which the architecture of ArchiveSphere will differ from that of ScholarSphere? 

In terms of system architecture, ArchiveSphere and ScholarSphere, though they will live on different machines due to the extra level of protection we need for the data in ArchiveSphere, are identical: both are Rails web applications that speak to Fedora, as an asset management system with preservation functions, and Solr, as a search index, via a suite of community-developed Ruby components. That is, they’re both Hydra applications.  

In terms of software architecture, there’s a lot of overlap.  Both are based on a gem called Sufia, which was originally developed as the “guts” of ScholarSphere, and is now used by nearly a dozen institutions to power their own repository applications.  And both use core Hydra components such as hydra-head, active-fedora, and blacklight.  We have also been working on two new community components called hydra-collections and hydra-derivatives, which will again be used across both of our spheres and within the Hydra community.

2. What metadata schemas will be used with ArchiveSphere?

We’re looking at PREMIS implementation for preservation metadata — and, in particular, the RDF-based version — which is an exciting challenge to consider. Descriptive metadata needs a little more fleshing out. At the object level in ScholarSphere we primarily use the RDF-based Dublin Core terms vocabulary (with a couple of other elements thrown in where DC had gaps), but ArchiveSphere is an archival repository, and one of the aspects that sets it apart from our work with ScholarSphere is the need to consider aggregate metadata — the description we assign to collections, series, boxes, etc., all of which apply to the individual objects as well.

Obviously, EAD is the standard we have to work with here, but we want to look at ways to make it less EAD-ish on the public side, and more integrated with ArchivesSpace on the administrative end.

3. Is the team considering building in support for forensic processing? What about automated metadata extraction?

Forensic processing is very much on the radar of our group, but it’s not a high priority. Why? For one, Penn State is still in the process of hashing out its forensic workflows, yet development on ArchiveSphere has already started. Furthermore, we feel that other Hydra partners with more mature forensic workflows may be better positioned to take what we’re doing with ArchiveSphere and work in their digital forensic concerns (UVa, Stanford, etc.), which we can then adopt/adapt locally (the joys of community development!). Finally, we have a lot of material that has come to us (and material that continues to be acquired) virtually, without any use of physical media at all, or using physical media supplied by the archives. These constitute our largest born-digital collections.

For now, we plan to create disk images using local workflows, while keeping an eye on how our development might evolve to include disk images down the line. 

We are planning to use FITS (File Information Tool Set, an open source characterization tool from Harvard U. Libraries) to extract metadata and file format features from deposited files, and Tika for full-text extraction and search.

4. Is the team considering incorporating any of the functionality of tools like Archivematica and/or BitCurator?

ArchiveSphere draws a lot of its influence from the work of Archivematica, and while there is some overlap in the functionality between the two, we also feel that there are certain benefits to developing Amatica-like tools within the Hydra framework and community rather than seriously considering integration between Archivematica and ArchiveSphere, which run on different technologies and have different architectural and workflow assumptions.

BitCurator: TBD. Penn State is moving toward adoption of the BitCurator tools for its developing forensic workflows, so opportunities may arise. Archivematica will have support for disk images soon, but it won’t actually wrap the BC tools into their interface (as far as we know). There might be some future development opportunities with Hydra here, but we feel this level of forensic tool integration with repository apps is still a ways off in the profession.

5. Are you planning to facilitate various layers of access? For example, items restricted to Penn State users, items restricted to a group defined by the archivist, items kept dark for a period of time, only metadata (not bitstream) accessible to public, certain administrative metadata fields hidden from public, etc.?

Yes, ALL of the above.

Nuanced access options are highly desirable, and one of the primary motivations for this project.  The first phase of the project, however, is a back-office tool only, so we may not build out this level of access controls in our first release.  It should be noted that Hydra already provides tooling for most of this out of the box, so it’s not “hard,” it’s mostly that we have lots of other priorities for the first phase of the project.

6. Are you planning to facilitate the rendering of various file types within the user interface? For example, video, audio, CAD?

Our collection development priorities and use cases will drive these decisions. For instance, we don’t have CAD files, so no, but we do have design files, so we will need to accommodate formats from QuarkXPress and InDesign. We do have audio and video to consider, but it’s an open question whether we’ll develop custom functionality for this or build on tools developed by other Hydra partners (such as Indiana’s and Northwestern’s work on Avalon, and WGBH’s work on their audiovisual repository).

7. Are you planning to incorporate automated derivative creation for access copies?

Yep. We realize there is some debate about when the best time to do this kind of transformation is, but sorting out the costs and benefits is murky, and for now we’re operating on the assumption that normalization for access will occur at the point of ingest at the same time as normalization for preservation.   (We are using a brand new Ruby component for this called hydra-derivatives.) 

8. Are you planning to facilitate reuse of the collection material by classes and faculty members? For example, in the way that Northwestern’s Digital Image Library proposes to enable users to create their own image galleries. What about support for user-generated remixing and analysis of data, such as visualizations, data mining, mapping, timeline creation, etc.?

Right now, our delivery and access plans are archivally-focused. But once we can deliver born-digital collection materials, wrap them up with metadata about both digital and analog materials, and even begin to incorporate digitized material, we wonder: what is the utility of the traditional finding aid format? We’ll take it as our starting point and then try to disrupt it in strategic ways, and some of these might/should include visualization tools, or interfaces matched to the particular characteristics and usage needs of a particular genre (think email). But still TBD, as it’s not part of current planning cycles. 

Related: we’ve had requests for corpus data sets (e.g. digital newspapers) from remote researchers, so accommodating such needs is definitely on our radar, but as you can see, a lot of features are on our radar and we’re working on multiple Hydra apps concurrently, so we have to push some off for now.

9. Are you planning to incorporate support for web archiving and access to web archive records?

Penn State is an Archive-It partner, but web captures are archived remotely on Archive-It’s servers. Archive-It does provide WARC export of web captures at the end of the partnership (and we are exploring the use of tools like WARCreate to generate local WARC files), but we am have not seen many impressive exemplars of how to deliver WARC for access in locally developed systems. At this point, because of our partnership with A-It, it’s not a high priority use case.

10. Are you planning to facilitate the collection of content as it is created? For example, collecting materials from faculty members as they create their born-digital manuscripts, rather than waiting until they retire to accession them.

Absolutely! We need to work out how material deposited in ScholarSphere ultimately flows into a formal archival repository (university archives/ArchiveSphere), but at a much higher level, we simply need to work out the organizational collaboration that gets everyone on the same page about this. Furthermore, our collecting of faculty members would have to become more active and less reactive (which is not a knock on our fantastic university archives program but an acknowledgement of a problem all archives currently face). We also want to enable this kind of functionality for offices to deposit files into the institutional/university archives.

Introducing ArchiveSphere

ArchiveSphere is the given name for a project between Penn State University Libraries and Information Technology Services. “Sphere” brands the project as being part of a set of repository services built using Project Hydra technologies; our first such service was ScholarSphere. “Archive” conveys that the project will create services for preserving, managing, and providing access to digital objects, in a way that is informed by archival thinking and practices.
Just what does this mean? We’ve spent the first few months of the project figuring that out ourselves. Many institutions have tested the utility of repository applications like DSpace or Fedora to store and deliver digital objects acquired as part of larger (and largely analog) archival collections. But what are the characteristics of storage and delivery?
With ArchiveSphere, we will provide a platform that archivists can use to deposit hierarchies of digital material from legacy media. The system will preserve the relational and hierarchical connections between files, while also providing archivists with tools that permit rearrangement and classification. Preservation actions like file characterization and normalization will be automated, as will virus checking and provenance event logging. We will leverage existing collection-level description found in collection management tools in a way that makes explicit the connections between digital and analog materials in hybrid collections. Access mechanisms will be provided that build on some of the great features found in ScholarSphere such as persistent unique identifiers, full-text indexing, integrity checking, and simple deposit, while also leveraging traditional archival discovery mechanisms like finding aids.
We’ve identified four main phases for development: 1) ingest and preservation services for archives staff, 2) administrative tools for managing, arranging, and describing submissions for public access and discovery interfaces, 3) integration with ArchivesSpace for holistic management of archival context around repository materials, and 4) alternative submission tools, including self-deposit options for institutional records.
Requirements for phases 2-4 are still in development, and development on phase 1 will begin this summer. (Note that work on phase 1 is focused on an administrative interface rather
than a public interface.)
Posted on behalf of the ArchiveSphere project team.

ScholarSphere Springs Forward

Judging by the recurring snow fall and the lingering cold in various parts of the country, it may not feel quite like spring yet, but we on the ScholarSphere service team have a spring in our step! We’re pleased to announce the newest release of ScholarSphere – version 1.4. This version adds two new features as well as improvements to the user interface.

First, the features

Look who’s full-text indexing! That’s right. ScholarSphere functionality now includes full-text indexing and searching, a standard feature in most repository software applications. Because this feature enables keyword searching on more than the metadata that users input to describe their files, it expands the possibilities for rich content discovery.

A note about access for the files searched . . . The visibility levels of files (whether open access, Penn State, or private) that result from the search depend on permissions for visibility and whether a Penn State user is logged in or not. If not logged in, and users do a search, the results will be public files; neither private files, nor files only for the Penn State community, will be among the results. If logged in as a Penn State user, the results will include public files as well as files visible to the Penn State community. If logged in as a Penn State user with private files, then you, the logged-in user, will see your relevant private files in the results list.

Are you LinkedIn? From its launch ScholarSphere has integrated widgets for social networking services, such as Twitter and Facebook, and equally social citation management tools, such as Zotero and Mendeley, for easy, outward sharing of one’s research. ScholarSphere now includes a widget for LinkedIn, the popular professional networking tool.

The profile page each user receives upon logging into ScholarSphere also shows the most prominent networking sites, now including LinkedIn.

To link out to your page in these social networking services, just click on the “Edit your profile” button on the upper right-hand corner of the profile page and enter your handles for the ones you belong to. For example, for my LinkedIn account, I entered “/in/patriciah.”

Next, the user experience

Cleaner layout for the user profile page. The user profile page makes more efficient use of space by incorporating a tabbed interface to represent aspects of a user – namely, her highlighted files, profile (user information), and activity.

Functional, felicitous facets. ScholarSphere has always had a “Browse By” list of facets on the left-hand side of the site. At the end of each shortened facet list is a link taking users to the complete list of whatever facet is being accessed, be it “resource type,” “creator,” “keyword,” etc. This link opens up a dialogue box, now with an improved user interface, displaying the complete list, allowing users to sort numerically (in descending order) or alphabetically.

Features under active development

This spring is a busy one for the ScholarSphere development team. They’ll be working on integration of collections functionality, deposit by proxy, and a hook to the Dropbox service (to enable deposit of larger files). These features, which currently are the most in demand by our users, will position the service well for increasing adoption by campus entities, such as colleges and departments interested in showcasing the best of their students’ work, or grant-funded research projects wishing to disseminate their outputs in the form of presentations, preprints, data sets, and project reports. 

Also, a heads up: we will be doing another round of usability testing and thus recruiting for test users to give us feedback on the new features and functionalities in ScholarSphere. Recruitment emails to the Penn State community should go out sometime in April.

Last but not least, stay tuned to the Content Stewardship Council blog to learn more about our partnerships with other institutions on developing ScholarSphere, and how version 1.4 of ScholarSphere has been a concerted community effort.

Lots of possibilities abound! We’ve only touched the surface of what ScholarSphere can, and will, achieve.