The Art Of Naming Things (Part 2 of 2)

This series of posts was written by Rob Olendorf, Prystowsky Early Career Science Librarian, Physical and Mathematical Sciences Library.

In the first part of this post, we described some pretty good basic rules for naming things and applied them to naming variables in data, methods in software and analysis codes, and other internally facing things. However, our files and directories require just as much care, especially considering a modern research project can easily generate hundreds of thousands or even millions of files. Even with tens or hundreds of files, managing the complexity among collaborators can easily become overwhelming. A good plan for naming and organizing things from the start is critical.

Just as before, making your names expressive and concise is important. For instance mt.csv isn’t quite as expressive as mars_temperatures.csv. Conciseness is important too, most operating systems limit file size name to 255 characters, still far longer than you probably want them. A good rule of thumb is 64 characters, however when in doubt go over a little for clarity.

Although we often don’t think of it, the path is also part of a file’s name. We can leverage this to make even better names. Expanding on the example above, if we were measuring temperatures on several planets and keeping separate CSV files for each day of measurements, it might make sense to go with the scheme below.

  • planets/mars/temperatures_2015january4.csv
  • planets/mars/temperatures_2015january4csv
  • planets/saturn/temperatures_2015january5.csv

The scheme itself documents the data, and it makes sense in the context of our data. There are additional things to note. First, I prefer using month names rather than numbers. I have spent too much time cleaning up data where people mixed the two up when they are just numbers. Second, I always go from general to specific in file names. In this case, planets/mars. Third, we can do the same things for dates, as shown below.

  • temperatures/planets/2015/january/4/mars.csv

I don’t feel it’s better one way or the other. Just pick a naming scheme that fits your project, document it in your README file, ensure your collaborators are all in sync, and follow it.

Often, files change over time, requiring you to track what version the file is, and who changed it. Such changes can result in file names like below.

  • planets/mars/temperatures_2015january4_2_ro.csv

This approach almost always fails: It’s too easy for multiple people to use version 2; we don’t really know who RO is; and if there are a number of files, we don’t know if version 2 of one file is supposed to work with version 3 or 4 of another. Also, you are relying on people carefully versioning copies, which is rare. It will almost always result in a mess. Just don’t version this way. Use versioning software like Git.

Many research projects also result in a large number of automatically named files. Often, “badly” named files are really just examples of automatically named files. You might be tempted to rename them, and there is software that can do this well. However, it may not be a good idea. Often software that automatically renames files uses conventions that allows it to open or process that data. If you rename the files, you might break the system. For instance, these file names –

  • planets/mars/temperatures_1292869231
  • planets/mars/temperatures_1292869256

– are actually read by some software to understand that the data about foo was taken on Monday, 20 December 2010 18:20:31 +0000 and Monday 20 December 2010 18:20:56 +0000 respectively, and the software uses those dates to merge the data correctly. The number in this case is Unix or Epoch time, and the number of seconds is since 1971 January 1. Almost all programming languages understand this standard.

Finally, when dealing with file names and paths, always use relative paths. We might be tempted to use the example below

  • C:/Users/random_citizen/projects/astro_temps/planets/mars/temperatures_2015january4.csv

However, this path makes sense only within a Window environment. If someone were to try to use your project in a Mac, it would not work. Cross-platform functionality is never easy, but relative paths such as I have used elsewhere are preferred.

Naming things, directory structures, relative paths. They may not seem important at first. However, with experience, most people who work a lot with computer systems know how important it is. Naming things consistently and well will do several things for you. It will decrease your time cleaning up messes. It will also make it easier for you and your collaborators to understand and remember what you have done. Clean, well-organized data inspires trust, and may also increase the chances that it will be used, and cited by others. Remember, the primary goals are to make your data as organized and self-documenting as possible. Feel free to post suggestions and other ideas for naming as comments too! For such a seemingly simple thing, there are many ways of doing it well.


The Art of Naming Things (Part 1 of 2)

This series of posts was written by Rob Olendorf, Prystowsky Early Career Science Librarian, Physical and Mathematical Sciences Library.

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

I’ve always liked that saying, although I think for non-programmers , it probably doesn’t have the power it should have. Like everyone else, I’ve spent a lot of time naming things, often with little thought. Over time I’ve learned to give it more and more thought. Why would programmers care so much about naming things? The truth is they haven’t always, but as the field has progressed, and software development has become more collaborative, more distributed and more open, many hard lessons have been learned. Perhaps the most obvious lesson is that, like everyone else, programmers forget things. Second, with any project of any size, there will be multiple people involved all with slightly different information, all forgetting things. Then with personnel turnover, there is even more loss. The third lesson is that knowing what a piece of code (or data) is supposed to do is not always easy, especially if  you weren’t originally involved with the project.


Hypothetical loss of data over time. (Michener et al. 1997, with this original caption: “Fig. 1. Example of the normal degradation in information content associated with data and metadata over time (“information entropy”). Accidents or changes in storage technology (dashed line) may eliminate access to remaining raw data and metadata at any time.”)

Research data shares a lot with software, especially if you take research data in the largest sense, which includes the analysis, visualization codes, processing software, settings, machine output and every thing else used to derive conclusions from the data. As the figure depicts, loss starts almost immediately with memory loss. Loss continues as the creators move on with their lives and, sadly for all of us, die. Also, for most datasets, without significant input from the creators, the data would be difficult for anyone outside of the project to understand as well. The trick then, is to come up with easy, and easy to learn, ways to bake meaning into your data. Software developers have actually done a lot of the work for us, and at the very base of this practice is the art of naming things. Naming is the the core of doing research that is organized, open and reproducible.

Let’s consider a very simple example of consisting of tabular data with one column (based on an real example). There is a list of numbers. Obviously the list of numbers meant something to the researchers, and they may have even published from it. However, for anyone else, it is meaningless and useless. It also doesn’t inspire a great amount of trust in the resulting publication. Obviously, were missing a column header, so lets say its temp. Great! At least we didn’t use “x”. We now know its temperature… or do we? It could be “temporary”, probably not, but it could be something about temporary storage on a computer system. Its good to be concise, but favor full words when in doubt.

  • Be expressive but concise.
  • Favor complete words over abbreviations or acronyms.
  • Name carefully, it can be hard to switch names later.

We rename or header to temperature and it’s much better. It can be improved, though. Is this parameter measured in Fahrenheit, Celsius or Kelvin? One solution, and not a bad one, is to create a data dictionary. A data dictionary is a separate document where you document additional information about your the columns in a table. They are a best practice, but suffer from one BIG problem – rot. This happens when you change a variable name, or something about it, but fail to update your data dictionary. This is a ubiquitous problem for all metadata and documentation. For this reason, I like to keep some information right into my data or code. I would prefer temperature_celsius in this case. Now the critical information is embedded directly in the data. An outside user is spared the trouble of looking things up in the data dictionary. I prefer to use the data dictionary for more difficult and usually less immediately critical information, such as the instrument used to measure the temperature.

  • Create a data dictionary.
  • Beware of rot. Keep critical information close to the data.
  • Document as you go. Include time to update documentation as needed.

Naming is just as important in every other part of your project as well. If you are writing analysis code, you should follow the same basic principles when naming variables in your code. If you are tempted to use x or i, just don’t. Its common practice to use these variables in a loops, but I prefer index generically, or if I am looping over a collection of items, say squids I would prefer the index to just be squid. Its an art, read your code (or data) and do what seems the most informative. When naming functions or methods, describe what the function does, and don’t be afraid to use enough words. For instance, if you need a special function to compute the mean temperature on mars call it compute_mars_mean_temperature or mars_mean_temperature.

One final consideration is consistency. Software developers can get very nit picky, often for good reason. For instance, within programming languages, there are usually best practices for when to use singular or plural. I suggest using plural and singular consistently. Avoid spaces – they will work well, until they don’t; then you’re stuck with renaming things. Instead of spaces, use the snake_case, or CamelCase. It often doesn’t matter what you choose, but by being consistent, you make things much more readable and organized. Before you start a project, get together with anyone who will be involved with the data and write down your style guide. Decide if you will use snake_case or CamelCase, and where they should be used. Try to make as many decisions as you can before you do anything. Include your style guide in your README file and keep that README file in the top level of your project directory. Anytime you modify or add to your rules, make a note in your README. The README is also a great place for general information about your project, an abstract, how to cite your data, how to rerun your analyses etc.

  • Be consistent.
  • Make a README file.

Following these practices,  you should be able to create a decent data file or analysis script. In my mind I always try to make things read like English as much as possible (assuming English is your language of choice). I have found that taking some time thinking about my names saves me time in the long run. Also, in collaborative situations. time spent planning this is with your collaborators is time not spent in the future trying to figure out what you as a group  did three years ago at the beginning of the project.

So far, I’ve discussed how to name single things. However, rarely will a project consist of a single file. Its not unusual for a even a modest project to generate 100,000+ files. In part 2 I’ll describe how we can expand these principles to allow us to make even very large projects transparent and understandable.

If You’re Curious

One book that really inspired me is Clean Code (Martin, 2008). Although written for programmers, with most examples in Java, the foundational concepts, especially in the first few chapters apply to a wide variety of situations.

Cited Works

Martin, Robert C., and Michael C. Feathers. Clean Code: A Handbook of Agile Software Craftsmanship. Upper Saddle River, N.J: Prentice Hall, 2009.

Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. 1997. “Nongeospatial Metadata for the Ecological Sciences,” Ecological Applications 7: 330-342.

A Tale of Redaction

By Mahir Akgun, Digital Scholarship Graduate Assistant, and Patricia Hswe, ScholarSphere Service Manager 


What do you do when your self-deposit institutional repository has personally identifiable information, such as signatures, on content that has been uploaded and made publicly available? This is the situation that the ScholarSphere service team found itself in when we discovered student theses containing cover sheets with the signatures of thesis committee members. Actual personal signatures in cursive. What follows is a narrative of what we did, what we learned, and what we plan to do, to help prevent recurrences in the future.

What is redaction, and how should we do it?

After we encountered the signatures on the theses, our lead Hydra developer immediately made all the files private, putting them in a Box folder accessible to service team members. The ScholarSphere service manager and the digital scholarship services graduate assistant, in Publishing and Curation Services, began investigating methods for redacting the theses.

Paper redaction by Alex Wellerstein via Flickr CC BY 2.0

Paper redaction by Alex Wellerstein via Flickr CC BY 2.0

Redaction is the process of permanently removing content (text or graphic) from a document. If a document has sensitive content (e.g., SSNs, signatures, even residential addresses, etc.), redacting such content is necessary before sharing it publicly. We consulted various colleagues at other institutions who have addressed similar situations and explored and tested various methods for redaction. (Thank you, Cathy Flynn-Purvis, Lisa Johnston, and Steve Van Tuyl, for your suggestions!).

We tried out the following:

  1. Manual redaction – This is a low-tech, traditional way to redact electronic files that requires using a heavy-duty, thick, black marker. The process is relatively easy but time-consuming and has only minor risks. The method takes five steps:
    • Print out the page of the document containing the signature.
    • Using a heavy-duty, thick, black permanent Magic Marker (and a ruler), redact or cover all of the signature.
    • Photocopy the page to make sure no one can read through the redacted area from either back or front.
    • Scan the redacted page back into the system.
    • Use Adobe Acrobat Pro (or other PDF software) to replace the signature page, and upload the new version.
  2. Redaction using a rectangle tool to draw a filled black box – AVOID!
    • Adding an image layer over the sensitive content, or using a rectangle tool to draw a filled black box over it, is employed to redact electronic files. This approach is not foolproof, however: When sensitive information is blacked out with an image layer or rectangular box, the redacted information still exists in the document, which means a reader of the document can access the information easily by removing the image layer. In addition, if someone searches the document, sensitive information will be discoverable in the search.
  3. Redaction using the “Redact toolset” in Adobe Acrobat Pro
    • The Redact toolset is used for redacting PDF documents, allowing removal or blacking out of selected text in PDF documents. Using the Redact toolset in Adobe Acrobat Pro is a good alternative to redacting documents manually. It is a very easy and time-saving method. For detailed instructions on how to use the toolset, you can visit Adobe’s official web site at the link below:

As explained above, the second method is not appropriate, since it is subject to serious flaws. We tried both the first and last methods several times to see whether we got consistent results each time we tried them. The manual method provided consistent results each time. The Redact toolset worked very well most of the time we tried it, but it failed in two instances, and we were not able to determine why it failed. It could have been a user error, or a problem with one of the tools in the toolset. Since we were not sure 100% about the third method, we decided to go with the first one. Nonetheless, Adobe Acrobat Pro (or PDF manipulation software) is still required in redaction of electronic files even in the manual approach.

After the theses were redacted, our lead Hydra developer submitted the redacted files and made them open access again. We deleted the originally deposited files; most of them were scanned from print originals still held at Penn State, and some of them were deposited by the authors of the theses (and they have their original copies).

Lessons Learned

  1. We need a defined policy and process for redacting personally identifiable information, including in theses. The above experience (and our documentation of it) puts us well on our way to completing this.
  2. We need explicit wording in our Content Policy to help deter users from depositing this kind of sensitive content. The Policy has been updated to caution users from uploading items with personal signatures on them.
  3. We need to create a new administrative role in ScholarSphere – the “super user.” This feature is on our Hydra development roadmap for 2016. With a super user role, the service manager would be able to make sensitive files private, before implementing the redaction process, and “re-deposit” the redacted theses without relying more than necessary on developer time and support.
  4. We need to implement user guidance on the upload page in ScholarSphere regarding deposit of sensitive content. This approach would be an additional way to deter users from uploading content that we would have to redact later. Before the upload process begins, for example, there could be a pop-up that asks, “Does this item contain sensitive content, like personal signatures or SSNs?”
  5. We may choose to implement email notification for the service manager, alerting when new content is deposited into ScholarSphere. Such notification would make the service team regularly apprised of the type of content uploaded to the repository and alert members of any need to kick off the redaction process if sensitive content has been deposited.

Also, the National Institute of Standards and Technology released its “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)” (PDF) in 2010, and five years later it’s still excellent guidance.

Thanks to Karen Estlund, Associate Dean for Technology and Digital Strategies, for her advice and insight on this blog post.