Tag Archives: caps

CAPS: An ingest tool and curation platform prototype

Background
Mair´┐Żad wrote about the shortcomings of the digital library applications currently in use by the University Libraries:
To a large degree, our existing applications support discovery and access but do not address digital preservation needs – the management of the digital object over time. The storage model for our digital library collections has also not included digital preservation requirements, such as support for mitigation of format obsolescence, replication, and tiered storage strategies. Managing digital assets across their entire life span is thus a key goal of our program.
She then introduced the CAPS project as the first phase of the Content Stewardship program’s effort to address these shortcomings:
The results of the platform review reinforced our determination to develop a new architecture and platform to support needs not currently being met  – foremost amongst these being the deposit of scholarly content, research data, and electronic business records from the University Archives. We initiated the Curation Architecture Prototype Services (CAPS) project this month with a projected four-month period for the prototype phase. The platform is based on a service-oriented architecture model, and entails the development of “microservices” – atomistic services to support functionalities such as “ingest,” “store,” “replicate,” or “annotate,” for example.
caps screen.jpg
Introduction

CAPS, in short, is a curation tool for ingest and management — description, versioning, audit, and storage — of digital objects.  What we’ve developed during the prototype phase (12/2010 – 03/2011) is a Django-based web application that allows curators to ingest digital objects and metadata into a curation environment — where we define digital object as one or more files of any type, the idea being that curators are free to define what constitutes a digital object for their needs.  The microservices we have developed so far are Python modules but we’re looking at service frameworks, such as HTTP REST and OpenSRF (XMPP), for scalability and separation of application and service layers.  

Every digital object is assigned an ARK identifier which is laid out on the filesystem via the Pairtree specification.  Objects stored in the curation environment are serialized into BagIt bags, so fixities are a core part of the object.  Each digital object is also a git repository, and the storage microservice knows how to talk to git, so all changes are tracked in a widely used version control system.  Metadata, both descriptive and administrative, are mapped to RDF vocabularies and stored both within the object (serialized on disk in the ntriples format) and in a central triplestore, which should allow us to expose objects easily via linked data once we build a public display/exhibit app.  
While developing the prototype ingest & management application, we’ve tried to track and avoid common issues that arise with ingest.  I’ve written more about ingest lessons to be learned elsewhere.
Next Steps
The prototype phase of development is winding down and it’s looking like the project will be a successful one with buy-in from curators within the University Libraries.  What’s next?  There are a number of next-phase projects that have been discussed:
  • Electronic records ingest & management
  • Electronic thesis & dissertation workflow
  • Public browse/exhibit/search application
  • Back-end scalability & performance testing
We also have gathered a laundry list of features we’re excited to implement but simply haven’t yet had the time or were explicitly out of scope:
  • Retention periods
  • Event logging service
  • Notification service
  • Replication
  • Routine audits
  • Object versioning/difference views
  • Exposure of objects and metadata via the linked data pattern
  • “Frameworkizing” the curation services
  • Format migration tools 
  • Digital object & collection usage statistics and reporting for curators
  • Controlled vocabulary/ontology management
  • Rights, access control, and authorization for digital objects
  • Publication of digital objects from ingest applications to display/exhibit applications (including legacy applications)
Transparency
Coding and testing of CAPS has been occurring out in the open.  You can find the code and unit tests on Github.