Monthly Archives: February 2011

The ARL/DLF E-Science Institute

The Association of Research Libraries and the Digital Library Federation are jointly sponsoring an E-Science Institute which will begin later this year.  The program is a part of ARL’s Transforming Research Libraries initiative.  Out of this Institute, ARL and DLF hope to assist libraries in developing program plans in this area, provide professional development opportunities, and promote cooperation across libraries as they develop services for the sciences on their own campuses.  

The Institute will be designed so that sponsoring libraries will convene a small team  who will work on assignments in the second half of 2011 and then meet as part of a 2.5 day workshop towards the end of the year.  Penn State is a sponsoring institution and so we will have an opportunity to participate in the program when it starts up. 

I have agreed to be one of the faculty for the Institute and will be working intensely with the others in the next several months to design the curriculum. 

In the meantime:  HELP!  Our first planning meeting will be held in one week, and by mid-week I will prepare a 2 page outline of Penn State’s “near-term (12-18 month) E-Science strategic agenda–its  assumptions, priorities, and activities–for the perspective of your own organization and experience in this area.”  For our planning group this is a conversation starter. It’s not meant to be a true strategic plan or draw heavily from existing documents.  

But perhaps for us, this is a good opportunity to consider some issues. How might the Institute help us at Penn State?   If you were writing this, what would you highlight?  Or what questions might you pose?  What are the local factors that we need to consider going forward?   What activities do we need to plan for?  

Please post comments here, and I’ll try to incorporate your ideas to the extent that I can in a short document.  I’ll post the final version for discussion and I’ll keep everyone informed as the plans for the program develops. 

CAPS: An ingest tool and curation platform prototype

Mair´┐Żad wrote about the shortcomings of the digital library applications currently in use by the University Libraries:
To a large degree, our existing applications support discovery and access but do not address digital preservation needs – the management of the digital object over time. The storage model for our digital library collections has also not included digital preservation requirements, such as support for mitigation of format obsolescence, replication, and tiered storage strategies. Managing digital assets across their entire life span is thus a key goal of our program.
She then introduced the CAPS project as the first phase of the Content Stewardship program’s effort to address these shortcomings:
The results of the platform review reinforced our determination to develop a new architecture and platform to support needs not currently being met  – foremost amongst these being the deposit of scholarly content, research data, and electronic business records from the University Archives. We initiated the Curation Architecture Prototype Services (CAPS) project this month with a projected four-month period for the prototype phase. The platform is based on a service-oriented architecture model, and entails the development of “microservices” – atomistic services to support functionalities such as “ingest,” “store,” “replicate,” or “annotate,” for example.
caps screen.jpg

CAPS, in short, is a curation tool for ingest and management — description, versioning, audit, and storage — of digital objects.  What we’ve developed during the prototype phase (12/2010 – 03/2011) is a Django-based web application that allows curators to ingest digital objects and metadata into a curation environment — where we define digital object as one or more files of any type, the idea being that curators are free to define what constitutes a digital object for their needs.  The microservices we have developed so far are Python modules but we’re looking at service frameworks, such as HTTP REST and OpenSRF (XMPP), for scalability and separation of application and service layers.  

Every digital object is assigned an ARK identifier which is laid out on the filesystem via the Pairtree specification.  Objects stored in the curation environment are serialized into BagIt bags, so fixities are a core part of the object.  Each digital object is also a git repository, and the storage microservice knows how to talk to git, so all changes are tracked in a widely used version control system.  Metadata, both descriptive and administrative, are mapped to RDF vocabularies and stored both within the object (serialized on disk in the ntriples format) and in a central triplestore, which should allow us to expose objects easily via linked data once we build a public display/exhibit app.  
While developing the prototype ingest & management application, we’ve tried to track and avoid common issues that arise with ingest.  I’ve written more about ingest lessons to be learned elsewhere.
Next Steps
The prototype phase of development is winding down and it’s looking like the project will be a successful one with buy-in from curators within the University Libraries.  What’s next?  There are a number of next-phase projects that have been discussed:
  • Electronic records ingest & management
  • Electronic thesis & dissertation workflow
  • Public browse/exhibit/search application
  • Back-end scalability & performance testing
We also have gathered a laundry list of features we’re excited to implement but simply haven’t yet had the time or were explicitly out of scope:
  • Retention periods
  • Event logging service
  • Notification service
  • Replication
  • Routine audits
  • Object versioning/difference views
  • Exposure of objects and metadata via the linked data pattern
  • “Frameworkizing” the curation services
  • Format migration tools 
  • Digital object & collection usage statistics and reporting for curators
  • Controlled vocabulary/ontology management
  • Rights, access control, and authorization for digital objects
  • Publication of digital objects from ingest applications to display/exhibit applications (including legacy applications)
Coding and testing of CAPS has been occurring out in the open.  You can find the code and unit tests on Github.