“ScholarSphering”

It’s been a while since we’ve blogged about ScholarSphere activities – but not for any lack of them! The service team has been incredibly busy since the release of ScholarSphere 1.0. Developers, especially, are coding away on ScholarSphere 2.0, for release this fall, when we will unveil – TADA! – a fabulous new user interface. In other words, we’ve been “ScholarSphering”! Below is what’s been happening of late.

ScholarSphere Collaborates with Zotero!

  • Ellysa Cahoy (Education Librarian at Penn State) developed the grant, in collaboration with Sean Takats (Associate Professor of History and Director of Research Projects at the Roy Rosenzweig Center for History and New Media).

ScholarSphere 2.0 and the ScholarSphere Users Group

    • In spring 2014 we launched the ScholarSphere Users Group (SUG), consisting of teaching faculty, librarians, and library staff. The SUG is one of the engines driving the 2.0 release!
    • This is a lightweight commitment, since SUG interactions take place mainly in Yammer.
    • Here’s our process:
      • Michael Tribone, UI/UX designer, posts user interface designs to Yammer.
      • SUG members offer feedback, generating ideas for improvements.
      • A revised design goes up, and additional responses are gathered.
      • Rinse. Repeat.
      • We do a quick poll to decide on a design, and, once decided, it’s becomes the design for implementation.
  • The service team will be conducting user interviews with a few SUG members in summer 2014, to help us get a fuller sense of what the UX needs to be like for researchers, what tools they currently use, and how ScholarSphere fits, or could fit, into that workflow.
  • Interested in being an SUG member? Request to join our group in Yammer!
  • Watch for a future blog post that will tell more about the SUG, its members, and its activities.

A Gentle Reminder of these Key Features in ScholarSphere

    • Create sets, groups, or collections of files
  • Get large files (> 500MB) into ScholarSphere via Dropbox
  • Allow you to give permission to others to deposit files on their behalf
  • Enable you to transfer ownership of files (perhaps after you’ve give permission for another person to deposit those files)

Next blog post: On promoting and marketing ScholarSphere

ArchiveSphere FAQs

1. What are the main ways in which the architecture of ArchiveSphere will differ from that of ScholarSphere? 

In terms of system architecture, ArchiveSphere and ScholarSphere, though they will live on different machines due to the extra level of protection we need for the data in ArchiveSphere, are identical: both are Rails web applications that speak to Fedora, as an asset management system with preservation functions, and Solr, as a search index, via a suite of community-developed Ruby components. That is, they’re both Hydra applications.  

In terms of software architecture, there’s a lot of overlap.  Both are based on a gem called Sufia, which was originally developed as the “guts” of ScholarSphere, and is now used by nearly a dozen institutions to power their own repository applications.  And both use core Hydra components such as hydra-head, active-fedora, and blacklight.  We have also been working on two new community components called hydra-collections and hydra-derivatives, which will again be used across both of our spheres and within the Hydra community.

2. What metadata schemas will be used with ArchiveSphere?

We’re looking at PREMIS implementation for preservation metadata — and, in particular, the RDF-based version — which is an exciting challenge to consider. Descriptive metadata needs a little more fleshing out. At the object level in ScholarSphere we primarily use the RDF-based Dublin Core terms vocabulary (with a couple of other elements thrown in where DC had gaps), but ArchiveSphere is an archival repository, and one of the aspects that sets it apart from our work with ScholarSphere is the need to consider aggregate metadata — the description we assign to collections, series, boxes, etc., all of which apply to the individual objects as well.

Obviously, EAD is the standard we have to work with here, but we want to look at ways to make it less EAD-ish on the public side, and more integrated with ArchivesSpace on the administrative end.

3. Is the team considering building in support for forensic processing? What about automated metadata extraction?

Forensic processing is very much on the radar of our group, but it’s not a high priority. Why? For one, Penn State is still in the process of hashing out its forensic workflows, yet development on ArchiveSphere has already started. Furthermore, we feel that other Hydra partners with more mature forensic workflows may be better positioned to take what we’re doing with ArchiveSphere and work in their digital forensic concerns (UVa, Stanford, etc.), which we can then adopt/adapt locally (the joys of community development!). Finally, we have a lot of material that has come to us (and material that continues to be acquired) virtually, without any use of physical media at all, or using physical media supplied by the archives. These constitute our largest born-digital collections.

For now, we plan to create disk images using local workflows, while keeping an eye on how our development might evolve to include disk images down the line. 

We are planning to use FITS (File Information Tool Set, an open source characterization tool from Harvard U. Libraries) to extract metadata and file format features from deposited files, and Tika for full-text extraction and search.

4. Is the team considering incorporating any of the functionality of tools like Archivematica and/or BitCurator?

ArchiveSphere draws a lot of its influence from the work of Archivematica, and while there is some overlap in the functionality between the two, we also feel that there are certain benefits to developing Amatica-like tools within the Hydra framework and community rather than seriously considering integration between Archivematica and ArchiveSphere, which run on different technologies and have different architectural and workflow assumptions.

BitCurator: TBD. Penn State is moving toward adoption of the BitCurator tools for its developing forensic workflows, so opportunities may arise. Archivematica will have support for disk images soon, but it won’t actually wrap the BC tools into their interface (as far as we know). There might be some future development opportunities with Hydra here, but we feel this level of forensic tool integration with repository apps is still a ways off in the profession.

5. Are you planning to facilitate various layers of access? For example, items restricted to Penn State users, items restricted to a group defined by the archivist, items kept dark for a period of time, only metadata (not bitstream) accessible to public, certain administrative metadata fields hidden from public, etc.?

Yes, ALL of the above.

Nuanced access options are highly desirable, and one of the primary motivations for this project.  The first phase of the project, however, is a back-office tool only, so we may not build out this level of access controls in our first release.  It should be noted that Hydra already provides tooling for most of this out of the box, so it’s not “hard,” it’s mostly that we have lots of other priorities for the first phase of the project.

6. Are you planning to facilitate the rendering of various file types within the user interface? For example, video, audio, CAD?

Our collection development priorities and use cases will drive these decisions. For instance, we don’t have CAD files, so no, but we do have design files, so we will need to accommodate formats from QuarkXPress and InDesign. We do have audio and video to consider, but it’s an open question whether we’ll develop custom functionality for this or build on tools developed by other Hydra partners (such as Indiana’s and Northwestern’s work on Avalon, and WGBH’s work on their audiovisual repository).

7. Are you planning to incorporate automated derivative creation for access copies?

Yep. We realize there is some debate about when the best time to do this kind of transformation is, but sorting out the costs and benefits is murky, and for now we’re operating on the assumption that normalization for access will occur at the point of ingest at the same time as normalization for preservation.   (We are using a brand new Ruby component for this called hydra-derivatives.) 

8. Are you planning to facilitate reuse of the collection material by classes and faculty members? For example, in the way that Northwestern’s Digital Image Library proposes to enable users to create their own image galleries. What about support for user-generated remixing and analysis of data, such as visualizations, data mining, mapping, timeline creation, etc.?

Right now, our delivery and access plans are archivally-focused. But once we can deliver born-digital collection materials, wrap them up with metadata about both digital and analog materials, and even begin to incorporate digitized material, we wonder: what is the utility of the traditional finding aid format? We’ll take it as our starting point and then try to disrupt it in strategic ways, and some of these might/should include visualization tools, or interfaces matched to the particular characteristics and usage needs of a particular genre (think email). But still TBD, as it’s not part of current planning cycles. 

Related: we’ve had requests for corpus data sets (e.g. digital newspapers) from remote researchers, so accommodating such needs is definitely on our radar, but as you can see, a lot of features are on our radar and we’re working on multiple Hydra apps concurrently, so we have to push some off for now.

9. Are you planning to incorporate support for web archiving and access to web archive records?

Penn State is an Archive-It partner, but web captures are archived remotely on Archive-It’s servers. Archive-It does provide WARC export of web captures at the end of the partnership (and we are exploring the use of tools like WARCreate to generate local WARC files), but we am have not seen many impressive exemplars of how to deliver WARC for access in locally developed systems. At this point, because of our partnership with A-It, it’s not a high priority use case.

10. Are you planning to facilitate the collection of content as it is created? For example, collecting materials from faculty members as they create their born-digital manuscripts, rather than waiting until they retire to accession them.

Absolutely! We need to work out how material deposited in ScholarSphere ultimately flows into a formal archival repository (university archives/ArchiveSphere), but at a much higher level, we simply need to work out the organizational collaboration that gets everyone on the same page about this. Furthermore, our collecting of faculty members would have to become more active and less reactive (which is not a knock on our fantastic university archives program but an acknowledgement of a problem all archives currently face). We also want to enable this kind of functionality for offices to deposit files into the institutional/university archives.

Introducing ArchiveSphere

ArchiveSphere is the given name for a project between Penn State University Libraries and Information Technology Services. “Sphere” brands the project as being part of a set of repository services built using Project Hydra technologies; our first such service was ScholarSphere. “Archive” conveys that the project will create services for preserving, managing, and providing access to digital objects, in a way that is informed by archival thinking and practices.

Just what does this mean? We’ve spent the first few months of the project figuring that out ourselves. Many institutions have tested the utility of repository applications like DSpace or Fedora to store and deliver digital objects acquired as part of larger (and largely analog) archival collections. But what are the characteristics of storage and delivery?

With ArchiveSphere, we will provide a platform that archivists can use to deposit hierarchies of digital material from legacy media. The system will preserve the relational and hierarchical connections between files, while also providing archivists with tools that permit rearrangement and classification. Preservation actions like file characterization and normalization will be automated, as will virus checking and provenance event logging. We will leverage existing collection-level description found in collection management tools in a way that makes explicit the connections between digital and analog materials in hybrid collections. Access mechanisms will be provided that build on some of the great features found in ScholarSphere such as persistent unique identifiers, full-text indexing, integrity checking, and simple deposit, while also leveraging traditional archival discovery mechanisms like finding aids.

We’ve identified four main phases for development: 1) ingest and preservation services for archives staff, 2) administrative tools for managing, arranging, and describing submissions for public access and discovery interfaces, 3) integration with ArchivesSpace for holistic management of archival context around repository materials, and 4) alternative submission tools, including self-deposit options for institutional records.

Requirements for phases 2-4 are still in development, and development on phase 1 will begin this summer. (Note that work on phase 1 is focused on an administrative interface rather than a public interface.)

Posted on behalf of the ArchiveSphere project team.