Author Archives: Patricia Hswe

A Tale of Redaction

By Mahir Akgun, Digital Scholarship Graduate Assistant, and Patricia Hswe, ScholarSphere Service Manager 


What do you do when your self-deposit institutional repository has personally identifiable information, such as signatures, on content that has been uploaded and made publicly available? This is the situation that the ScholarSphere service team found itself in when we discovered student theses containing cover sheets with the signatures of thesis committee members. Actual personal signatures in cursive. What follows is a narrative of what we did, what we learned, and what we plan to do, to help prevent recurrences in the future.

What is redaction, and how should we do it?

After we encountered the signatures on the theses, our lead Hydra developer immediately made all the files private, putting them in a Box folder accessible to service team members. The ScholarSphere service manager and the digital scholarship services graduate assistant, in Publishing and Curation Services, began investigating methods for redacting the theses.

Paper redaction by Alex Wellerstein via Flickr CC BY 2.0

Paper redaction by Alex Wellerstein via Flickr CC BY 2.0

Redaction is the process of permanently removing content (text or graphic) from a document. If a document has sensitive content (e.g., SSNs, signatures, even residential addresses, etc.), redacting such content is necessary before sharing it publicly. We consulted various colleagues at other institutions who have addressed similar situations and explored and tested various methods for redaction. (Thank you, Cathy Flynn-Purvis, Lisa Johnston, and Steve Van Tuyl, for your suggestions!).

We tried out the following:

  1. Manual redaction – This is a low-tech, traditional way to redact electronic files that requires using a heavy-duty, thick, black marker. The process is relatively easy but time-consuming and has only minor risks. The method takes five steps:
    • Print out the page of the document containing the signature.
    • Using a heavy-duty, thick, black permanent Magic Marker (and a ruler), redact or cover all of the signature.
    • Photocopy the page to make sure no one can read through the redacted area from either back or front.
    • Scan the redacted page back into the system.
    • Use Adobe Acrobat Pro (or other PDF software) to replace the signature page, and upload the new version.
  2. Redaction using a rectangle tool to draw a filled black box – AVOID!
    • Adding an image layer over the sensitive content, or using a rectangle tool to draw a filled black box over it, is employed to redact electronic files. This approach is not foolproof, however: When sensitive information is blacked out with an image layer or rectangular box, the redacted information still exists in the document, which means a reader of the document can access the information easily by removing the image layer. In addition, if someone searches the document, sensitive information will be discoverable in the search.
  3. Redaction using the “Redact toolset” in Adobe Acrobat Pro
    • The Redact toolset is used for redacting PDF documents, allowing removal or blacking out of selected text in PDF documents. Using the Redact toolset in Adobe Acrobat Pro is a good alternative to redacting documents manually. It is a very easy and time-saving method. For detailed instructions on how to use the toolset, you can visit Adobe’s official web site at the link below:

As explained above, the second method is not appropriate, since it is subject to serious flaws. We tried both the first and last methods several times to see whether we got consistent results each time we tried them. The manual method provided consistent results each time. The Redact toolset worked very well most of the time we tried it, but it failed in two instances, and we were not able to determine why it failed. It could have been a user error, or a problem with one of the tools in the toolset. Since we were not sure 100% about the third method, we decided to go with the first one. Nonetheless, Adobe Acrobat Pro (or PDF manipulation software) is still required in redaction of electronic files even in the manual approach.

After the theses were redacted, our lead Hydra developer submitted the redacted files and made them open access again. We deleted the originally deposited files; most of them were scanned from print originals still held at Penn State, and some of them were deposited by the authors of the theses (and they have their original copies).

Lessons Learned

  1. We need a defined policy and process for redacting personally identifiable information, including in theses. The above experience (and our documentation of it) puts us well on our way to completing this.
  2. We need explicit wording in our Content Policy to help deter users from depositing this kind of sensitive content. The Policy has been updated to caution users from uploading items with personal signatures on them.
  3. We need to create a new administrative role in ScholarSphere – the “super user.” This feature is on our Hydra development roadmap for 2016. With a super user role, the service manager would be able to make sensitive files private, before implementing the redaction process, and “re-deposit” the redacted theses without relying more than necessary on developer time and support.
  4. We need to implement user guidance on the upload page in ScholarSphere regarding deposit of sensitive content. This approach would be an additional way to deter users from uploading content that we would have to redact later. Before the upload process begins, for example, there could be a pop-up that asks, “Does this item contain sensitive content, like personal signatures or SSNs?”
  5. We may choose to implement email notification for the service manager, alerting when new content is deposited into ScholarSphere. Such notification would make the service team regularly apprised of the type of content uploaded to the repository and alert members of any need to kick off the redaction process if sensitive content has been deposited.

Also, the National Institute of Standards and Technology released its “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)” (PDF) in 2010, and five years later it’s still excellent guidance.

Thanks to Karen Estlund, Associate Dean for Technology and Digital Strategies, for her advice and insight on this blog post.

ScholarSphere “Office Hours”


office desk by Sean MacEntee
via Flickr CC BY 2.0

More ScholarSphere office hours are coming up, with a chance for the campuses to tune in as well! Brandy Karl, Copyright Office in the Libraries, will also be on hand to field questions.

Dates and times for virtual and in-person office hours are as follows (UP location is Paterno 126A):

  • Tuesday, March 10: 3-4 PM
  • Wednesday, March 11: 12-1 PM
  • Thursday, March 12: 1-2 PM

Simply go to (no need to register), and we’ll kick off with a brief overview of the service, followed by Q&A. Note: Any UP colleagues may join me in Paterno 126A and ask questions in person, in addition to viewing the PowerPoint.

What’s the advantage? With the passage of the Open Access Policy on February 11, 2015, in the Libraries, these office hours are an excellent opportunity to brush up on ScholarSphere, ask questions about uploading content, creating collections, sharing permissions, transferring ownership, proxy deposit, and more! Use this time, too,  to tell us what features you’d like to see!

ScholarSphere has a sandbox to “play” in. Never deposited before to ScholarSphere and wary of using the “real” site to do it? We have a demo, or sandbox, environment you can use to try out ScholarSphere. Come to the office hours to learn about “ScholarSphere Demo”!

Code Sprinting for Fedora 4

Dan Coughlin, Director of Software Development in ITS, blogs about Penn State migrating ScholarSphere to Fedora 4 – a leading contribution for the Hydra software community.

As we closed out January, the ScholarSphere development team was hard at work for a two-week code sprint. What’s a code sprint? All developers clear their calendars and work collectively on a common project or, in this case, a specific feature within a project to create a big push towards completion. Usually a sprint lasts for a week or two–where developers share a conference room, chocolate, coffee, and goldfish.

Photo of cheese-flavored goldfish crackers inside their package.

Goldfish by Cindy Stuntz at Flickr, CC BY 2.0

The goal for this sprint was to integrate the latest and greatest version of Fedora (Fedora 4) into our ScholarSphere testing environment. We want to make sure it works on our testing environment before releasing it on the live site to help minimize inconveniences to you, the user. ScholarSphere currently runs on Fedora 3; Fedora 3 handles most of the preservation functionality for our files (checksums, storage of technical metadata, versioning, etc.).

Penn State is likely to have the first Hydra application running Fedora 4 later this spring. Great! What does this mean to you? The two biggest areas of focus for improvement on Fedora 4 are file size and speed. ScholarSphere will be able to handle larger files and process those files more quickly than before. Unfortunately, the web isn’t the greatest method for uploading files of multiple gigabytes, so we will be exploring other ways (besides via the ScholarSphere web page) for our users to deposit large files into the service. In fact, if you have large files (1-10 GB) we would love to hear from you and discuss your ideas about how to deposit these files in a way that best meets your needs. Thanks for reading, and we look forward to hearing from you!