By Mahir Akgun, Digital Scholarship Graduate Assistant, and Patricia Hswe, ScholarSphere Service Manager
What do you do when your self-deposit institutional repository has personally identifiable information, such as signatures, on content that has been uploaded and made publicly available? This is the situation that the ScholarSphere service team found itself in when we discovered student theses containing cover sheets with the signatures of thesis committee members. Actual personal signatures in cursive. What follows is a narrative of what we did, what we learned, and what we plan to do, to help prevent recurrences in the future.
What is redaction, and how should we do it?
After we encountered the signatures on the theses, our lead Hydra developer immediately made all the files private, putting them in a Box folder accessible to service team members. The ScholarSphere service manager and the digital scholarship services graduate assistant, in Publishing and Curation Services, began investigating methods for redacting the theses.
Redaction is the process of permanently removing content (text or graphic) from a document. If a document has sensitive content (e.g., SSNs, signatures, even residential addresses, etc.), redacting such content is necessary before sharing it publicly. We consulted various colleagues at other institutions who have addressed similar situations and explored and tested various methods for redaction. (Thank you, Cathy Flynn-Purvis, Lisa Johnston, and Steve Van Tuyl, for your suggestions!).
We tried out the following:
- Manual redaction – This is a low-tech, traditional way to redact electronic files that requires using a heavy-duty, thick, black marker. The process is relatively easy but time-consuming and has only minor risks. The method takes five steps:
- Print out the page of the document containing the signature.
- Using a heavy-duty, thick, black permanent Magic Marker (and a ruler), redact or cover all of the signature.
- Photocopy the page to make sure no one can read through the redacted area from either back or front.
- Scan the redacted page back into the system.
- Use Adobe Acrobat Pro (or other PDF software) to replace the signature page, and upload the new version.
- Redaction using a rectangle tool to draw a filled black box – AVOID!
- Adding an image layer over the sensitive content, or using a rectangle tool to draw a filled black box over it, is employed to redact electronic files. This approach is not foolproof, however: When sensitive information is blacked out with an image layer or rectangular box, the redacted information still exists in the document, which means a reader of the document can access the information easily by removing the image layer. In addition, if someone searches the document, sensitive information will be discoverable in the search.
- Redaction using the “Redact toolset” in Adobe Acrobat Pro
- The Redact toolset is used for redacting PDF documents, allowing removal or blacking out of selected text in PDF documents. Using the Redact toolset in Adobe Acrobat Pro is a good alternative to redacting documents manually. It is a very easy and time-saving method. For detailed instructions on how to use the toolset, you can visit Adobe’s official web site at the link below: http://help.adobe.com/en_US/acrobat/X/pro/using/WS4E397D8A-B438-4b93-BB5F-E3161811C9C0.w.html.
As explained above, the second method is not appropriate, since it is subject to serious flaws. We tried both the first and last methods several times to see whether we got consistent results each time we tried them. The manual method provided consistent results each time. The Redact toolset worked very well most of the time we tried it, but it failed in two instances, and we were not able to determine why it failed. It could have been a user error, or a problem with one of the tools in the toolset. Since we were not sure 100% about the third method, we decided to go with the first one. Nonetheless, Adobe Acrobat Pro (or PDF manipulation software) is still required in redaction of electronic files even in the manual approach.
After the theses were redacted, our lead Hydra developer submitted the redacted files and made them open access again. We deleted the originally deposited files; most of them were scanned from print originals still held at Penn State, and some of them were deposited by the authors of the theses (and they have their original copies).
- We need a defined policy and process for redacting personally identifiable information, including in theses. The above experience (and our documentation of it) puts us well on our way to completing this.
- We need explicit wording in our Content Policy to help deter users from depositing this kind of sensitive content. The Policy has been updated to caution users from uploading items with personal signatures on them.
- We need to create a new administrative role in ScholarSphere – the “super user.” This feature is on our Hydra development roadmap for 2016. With a super user role, the service manager would be able to make sensitive files private, before implementing the redaction process, and “re-deposit” the redacted theses without relying more than necessary on developer time and support.
- We need to implement user guidance on the upload page in ScholarSphere regarding deposit of sensitive content. This approach would be an additional way to deter users from uploading content that we would have to redact later. Before the upload process begins, for example, there could be a pop-up that asks, “Does this item contain sensitive content, like personal signatures or SSNs?”
- We may choose to implement email notification for the service manager, alerting when new content is deposited into ScholarSphere. Such notification would make the service team regularly apprised of the type of content uploaded to the repository and alert members of any need to kick off the redaction process if sensitive content has been deposited.
Also, the National Institute of Standards and Technology released its “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)” (PDF) in 2010, and five years later it’s still excellent guidance.
Thanks to Karen Estlund, Associate Dean for Technology and Digital Strategies, for her advice and insight on this blog post.