This series of posts was written by Rob Olendorf, Prystowsky Early Career Science Librarian, Physical and Mathematical Sciences Library.
There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton
I’ve always liked that saying, although I think for non-programmers , it probably doesn’t have the power it should have. Like everyone else, I’ve spent a lot of time naming things, often with little thought. Over time I’ve learned to give it more and more thought. Why would programmers care so much about naming things? The truth is they haven’t always, but as the field has progressed, and software development has become more collaborative, more distributed and more open, many hard lessons have been learned. Perhaps the most obvious lesson is that, like everyone else, programmers forget things. Second, with any project of any size, there will be multiple people involved all with slightly different information, all forgetting things. Then with personnel turnover, there is even more loss. The third lesson is that knowing what a piece of code (or data) is supposed to do is not always easy, especially if you weren’t originally involved with the project.
Research data shares a lot with software, especially if you take research data in the largest sense, which includes the analysis, visualization codes, processing software, settings, machine output and every thing else used to derive conclusions from the data. As the figure depicts, loss starts almost immediately with memory loss. Loss continues as the creators move on with their lives and, sadly for all of us, die. Also, for most datasets, without significant input from the creators, the data would be difficult for anyone outside of the project to understand as well. The trick then, is to come up with easy, and easy to learn, ways to bake meaning into your data. Software developers have actually done a lot of the work for us, and at the very base of this practice is the art of naming things. Naming is the the core of doing research that is organized, open and reproducible.
Let’s consider a very simple example of consisting of tabular data with one column (based on an real example). There is a list of numbers. Obviously the list of numbers meant something to the researchers, and they may have even published from it. However, for anyone else, it is meaningless and useless. It also doesn’t inspire a great amount of trust in the resulting publication. Obviously, were missing a column header, so lets say its temp. Great! At least we didn’t use “x”. We now know its temperature… or do we? It could be “temporary”, probably not, but it could be something about temporary storage on a computer system. Its good to be concise, but favor full words when in doubt.
- Be expressive but concise.
- Favor complete words over abbreviations or acronyms.
- Name carefully, it can be hard to switch names later.
We rename or header to temperature and it’s much better. It can be improved, though. Is this parameter measured in Fahrenheit, Celsius or Kelvin? One solution, and not a bad one, is to create a data dictionary. A data dictionary is a separate document where you document additional information about your the columns in a table. They are a best practice, but suffer from one BIG problem – rot. This happens when you change a variable name, or something about it, but fail to update your data dictionary. This is a ubiquitous problem for all metadata and documentation. For this reason, I like to keep some information right into my data or code. I would prefer temperature_celsius in this case. Now the critical information is embedded directly in the data. An outside user is spared the trouble of looking things up in the data dictionary. I prefer to use the data dictionary for more difficult and usually less immediately critical information, such as the instrument used to measure the temperature.
- Create a data dictionary.
- Beware of rot. Keep critical information close to the data.
- Document as you go. Include time to update documentation as needed.
Naming is just as important in every other part of your project as well. If you are writing analysis code, you should follow the same basic principles when naming variables in your code. If you are tempted to use x or i, just don’t. Its common practice to use these variables in a loops, but I prefer index generically, or if I am looping over a collection of items, say squids I would prefer the index to just be squid. Its an art, read your code (or data) and do what seems the most informative. When naming functions or methods, describe what the function does, and don’t be afraid to use enough words. For instance, if you need a special function to compute the mean temperature on mars call it compute_mars_mean_temperature or mars_mean_temperature.
One final consideration is consistency. Software developers can get very nit picky, often for good reason. For instance, within programming languages, there are usually best practices for when to use singular or plural. I suggest using plural and singular consistently. Avoid spaces – they will work well, until they don’t; then you’re stuck with renaming things. Instead of spaces, use the snake_case, or CamelCase. It often doesn’t matter what you choose, but by being consistent, you make things much more readable and organized. Before you start a project, get together with anyone who will be involved with the data and write down your style guide. Decide if you will use snake_case or CamelCase, and where they should be used. Try to make as many decisions as you can before you do anything. Include your style guide in your README file and keep that README file in the top level of your project directory. Anytime you modify or add to your rules, make a note in your README. The README is also a great place for general information about your project, an abstract, how to cite your data, how to rerun your analyses etc.
- Be consistent.
- Make a README file.
Following these practices, you should be able to create a decent data file or analysis script. In my mind I always try to make things read like English as much as possible (assuming English is your language of choice). I have found that taking some time thinking about my names saves me time in the long run. Also, in collaborative situations. time spent planning this is with your collaborators is time not spent in the future trying to figure out what you as a group did three years ago at the beginning of the project.
So far, I’ve discussed how to name single things. However, rarely will a project consist of a single file. Its not unusual for a even a modest project to generate 100,000+ files. In part 2 I’ll describe how we can expand these principles to allow us to make even very large projects transparent and understandable.
If You’re Curious
One book that really inspired me is Clean Code (Martin, 2008). Although written for programmers, with most examples in Java, the foundational concepts, especially in the first few chapters apply to a wide variety of situations.
Martin, Robert C., and Michael C. Feathers. Clean Code: A Handbook of Agile Software Craftsmanship. Upper Saddle River, N.J: Prentice Hall, 2009.
Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. 1997. “Nongeospatial Metadata for the Ecological Sciences,” Ecological Applications 7: 330-342.