Monthly Archives: February 2016

The Art Of Naming Things (Part 2 of 2)

This series of posts was written by Rob Olendorf, Prystowsky Early Career Science Librarian, Physical and Mathematical Sciences Library.

In the first part of this post, we described some pretty good basic rules for naming things and applied them to naming variables in data, methods in software and analysis codes, and other internally facing things. However, our files and directories require just as much care, especially considering a modern research project can easily generate hundreds of thousands or even millions of files. Even with tens or hundreds of files, managing the complexity among collaborators can easily become overwhelming. A good plan for naming and organizing things from the start is critical.

Just as before, making your names expressive and concise is important. For instance mt.csv isn’t quite as expressive as mars_temperatures.csv. Conciseness is important too, most operating systems limit file size name to 255 characters, still far longer than you probably want them. A good rule of thumb is 64 characters, however when in doubt go over a little for clarity.

Although we often don’t think of it, the path is also part of a file’s name. We can leverage this to make even better names. Expanding on the example above, if we were measuring temperatures on several planets and keeping separate CSV files for each day of measurements, it might make sense to go with the scheme below.

  • planets/mars/temperatures_2015january4.csv
  • planets/mars/temperatures_2015january4csv
  • planets/saturn/temperatures_2015january5.csv

The scheme itself documents the data, and it makes sense in the context of our data. There are additional things to note. First, I prefer using month names rather than numbers. I have spent too much time cleaning up data where people mixed the two up when they are just numbers. Second, I always go from general to specific in file names. In this case, planets/mars. Third, we can do the same things for dates, as shown below.

  • temperatures/planets/2015/january/4/mars.csv

I don’t feel it’s better one way or the other. Just pick a naming scheme that fits your project, document it in your README file, ensure your collaborators are all in sync, and follow it.

Often, files change over time, requiring you to track what version the file is, and who changed it. Such changes can result in file names like below.

  • planets/mars/temperatures_2015january4_2_ro.csv

This approach almost always fails: It’s too easy for multiple people to use version 2; we don’t really know who RO is; and if there are a number of files, we don’t know if version 2 of one file is supposed to work with version 3 or 4 of another. Also, you are relying on people carefully versioning copies, which is rare. It will almost always result in a mess. Just don’t version this way. Use versioning software like Git.

Many research projects also result in a large number of automatically named files. Often, “badly” named files are really just examples of automatically named files. You might be tempted to rename them, and there is software that can do this well. However, it may not be a good idea. Often software that automatically renames files uses conventions that allows it to open or process that data. If you rename the files, you might break the system. For instance, these file names –

  • planets/mars/temperatures_1292869231
  • planets/mars/temperatures_1292869256

– are actually read by some software to understand that the data about foo was taken on Monday, 20 December 2010 18:20:31 +0000 and Monday 20 December 2010 18:20:56 +0000 respectively, and the software uses those dates to merge the data correctly. The number in this case is Unix or Epoch time, and the number of seconds is since 1971 January 1. Almost all programming languages understand this standard.

Finally, when dealing with file names and paths, always use relative paths. We might be tempted to use the example below

  • C:/Users/random_citizen/projects/astro_temps/planets/mars/temperatures_2015january4.csv

However, this path makes sense only within a Window environment. If someone were to try to use your project in a Mac, it would not work. Cross-platform functionality is never easy, but relative paths such as I have used elsewhere are preferred.

Naming things, directory structures, relative paths. They may not seem important at first. However, with experience, most people who work a lot with computer systems know how important it is. Naming things consistently and well will do several things for you. It will decrease your time cleaning up messes. It will also make it easier for you and your collaborators to understand and remember what you have done. Clean, well-organized data inspires trust, and may also increase the chances that it will be used, and cited by others. Remember, the primary goals are to make your data as organized and self-documenting as possible. Feel free to post suggestions and other ideas for naming as comments too! For such a seemingly simple thing, there are many ways of doing it well.

 

The Art of Naming Things (Part 1 of 2)

This series of posts was written by Rob Olendorf, Prystowsky Early Career Science Librarian, Physical and Mathematical Sciences Library.

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

I’ve always liked that saying, although I think for non-programmers , it probably doesn’t have the power it should have. Like everyone else, I’ve spent a lot of time naming things, often with little thought. Over time I’ve learned to give it more and more thought. Why would programmers care so much about naming things? The truth is they haven’t always, but as the field has progressed, and software development has become more collaborative, more distributed and more open, many hard lessons have been learned. Perhaps the most obvious lesson is that, like everyone else, programmers forget things. Second, with any project of any size, there will be multiple people involved all with slightly different information, all forgetting things. Then with personnel turnover, there is even more loss. The third lesson is that knowing what a piece of code (or data) is supposed to do is not always easy, especially if  you weren’t originally involved with the project.

mcmfig1c

Hypothetical loss of data over time. (Michener et al. 1997, with this original caption: “Fig. 1. Example of the normal degradation in information content associated with data and metadata over time (“information entropy”). Accidents or changes in storage technology (dashed line) may eliminate access to remaining raw data and metadata at any time.”)

Research data shares a lot with software, especially if you take research data in the largest sense, which includes the analysis, visualization codes, processing software, settings, machine output and every thing else used to derive conclusions from the data. As the figure depicts, loss starts almost immediately with memory loss. Loss continues as the creators move on with their lives and, sadly for all of us, die. Also, for most datasets, without significant input from the creators, the data would be difficult for anyone outside of the project to understand as well. The trick then, is to come up with easy, and easy to learn, ways to bake meaning into your data. Software developers have actually done a lot of the work for us, and at the very base of this practice is the art of naming things. Naming is the the core of doing research that is organized, open and reproducible.

Let’s consider a very simple example of consisting of tabular data with one column (based on an real example). There is a list of numbers. Obviously the list of numbers meant something to the researchers, and they may have even published from it. However, for anyone else, it is meaningless and useless. It also doesn’t inspire a great amount of trust in the resulting publication. Obviously, were missing a column header, so lets say its temp. Great! At least we didn’t use “x”. We now know its temperature… or do we? It could be “temporary”, probably not, but it could be something about temporary storage on a computer system. Its good to be concise, but favor full words when in doubt.

  • Be expressive but concise.
  • Favor complete words over abbreviations or acronyms.
  • Name carefully, it can be hard to switch names later.

We rename or header to temperature and it’s much better. It can be improved, though. Is this parameter measured in Fahrenheit, Celsius or Kelvin? One solution, and not a bad one, is to create a data dictionary. A data dictionary is a separate document where you document additional information about your the columns in a table. They are a best practice, but suffer from one BIG problem – rot. This happens when you change a variable name, or something about it, but fail to update your data dictionary. This is a ubiquitous problem for all metadata and documentation. For this reason, I like to keep some information right into my data or code. I would prefer temperature_celsius in this case. Now the critical information is embedded directly in the data. An outside user is spared the trouble of looking things up in the data dictionary. I prefer to use the data dictionary for more difficult and usually less immediately critical information, such as the instrument used to measure the temperature.

  • Create a data dictionary.
  • Beware of rot. Keep critical information close to the data.
  • Document as you go. Include time to update documentation as needed.

Naming is just as important in every other part of your project as well. If you are writing analysis code, you should follow the same basic principles when naming variables in your code. If you are tempted to use x or i, just don’t. Its common practice to use these variables in a loops, but I prefer index generically, or if I am looping over a collection of items, say squids I would prefer the index to just be squid. Its an art, read your code (or data) and do what seems the most informative. When naming functions or methods, describe what the function does, and don’t be afraid to use enough words. For instance, if you need a special function to compute the mean temperature on mars call it compute_mars_mean_temperature or mars_mean_temperature.

One final consideration is consistency. Software developers can get very nit picky, often for good reason. For instance, within programming languages, there are usually best practices for when to use singular or plural. I suggest using plural and singular consistently. Avoid spaces – they will work well, until they don’t; then you’re stuck with renaming things. Instead of spaces, use the snake_case, or CamelCase. It often doesn’t matter what you choose, but by being consistent, you make things much more readable and organized. Before you start a project, get together with anyone who will be involved with the data and write down your style guide. Decide if you will use snake_case or CamelCase, and where they should be used. Try to make as many decisions as you can before you do anything. Include your style guide in your README file and keep that README file in the top level of your project directory. Anytime you modify or add to your rules, make a note in your README. The README is also a great place for general information about your project, an abstract, how to cite your data, how to rerun your analyses etc.

  • Be consistent.
  • Make a README file.

Following these practices,  you should be able to create a decent data file or analysis script. In my mind I always try to make things read like English as much as possible (assuming English is your language of choice). I have found that taking some time thinking about my names saves me time in the long run. Also, in collaborative situations. time spent planning this is with your collaborators is time not spent in the future trying to figure out what you as a group  did three years ago at the beginning of the project.

So far, I’ve discussed how to name single things. However, rarely will a project consist of a single file. Its not unusual for a even a modest project to generate 100,000+ files. In part 2 I’ll describe how we can expand these principles to allow us to make even very large projects transparent and understandable.

If You’re Curious

One book that really inspired me is Clean Code (Martin, 2008). Although written for programmers, with most examples in Java, the foundational concepts, especially in the first few chapters apply to a wide variety of situations.

Cited Works

Martin, Robert C., and Michael C. Feathers. Clean Code: A Handbook of Agile Software Craftsmanship. Upper Saddle River, N.J: Prentice Hall, 2009.

Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. 1997. “Nongeospatial Metadata for the Ecological Sciences,” Ecological Applications 7: 330-342.