This series of posts was written by Rob Olendorf, Prystowsky Early Career Science Librarian, Physical and Mathematical Sciences Library.
In the first part of this post, we described some pretty good basic rules for naming things and applied them to naming variables in data, methods in software and analysis codes, and other internally facing things. However, our files and directories require just as much care, especially considering a modern research project can easily generate hundreds of thousands or even millions of files. Even with tens or hundreds of files, managing the complexity among collaborators can easily become overwhelming. A good plan for naming and organizing things from the start is critical.
Just as before, making your names expressive and concise is important. For instance mt.csv isn’t quite as expressive as mars_temperatures.csv. Conciseness is important too, most operating systems limit file size name to 255 characters, still far longer than you probably want them. A good rule of thumb is 64 characters, however when in doubt go over a little for clarity.
Although we often don’t think of it, the path is also part of a file’s name. We can leverage this to make even better names. Expanding on the example above, if we were measuring temperatures on several planets and keeping separate CSV files for each day of measurements, it might make sense to go with the scheme below.
The scheme itself documents the data, and it makes sense in the context of our data. There are additional things to note. First, I prefer using month names rather than numbers. I have spent too much time cleaning up data where people mixed the two up when they are just numbers. Second, I always go from general to specific in file names. In this case, planets/mars. Third, we can do the same things for dates, as shown below.
I don’t feel it’s better one way or the other. Just pick a naming scheme that fits your project, document it in your README file, ensure your collaborators are all in sync, and follow it.
Often, files change over time, requiring you to track what version the file is, and who changed it. Such changes can result in file names like below.
This approach almost always fails: It’s too easy for multiple people to use version 2; we don’t really know who RO is; and if there are a number of files, we don’t know if version 2 of one file is supposed to work with version 3 or 4 of another. Also, you are relying on people carefully versioning copies, which is rare. It will almost always result in a mess. Just don’t version this way. Use versioning software like Git.
Many research projects also result in a large number of automatically named files. Often, “badly” named files are really just examples of automatically named files. You might be tempted to rename them, and there is software that can do this well. However, it may not be a good idea. Often software that automatically renames files uses conventions that allows it to open or process that data. If you rename the files, you might break the system. For instance, these file names –
– are actually read by some software to understand that the data about foo was taken on Monday, 20 December 2010 18:20:31 +0000 and Monday 20 December 2010 18:20:56 +0000 respectively, and the software uses those dates to merge the data correctly. The number in this case is Unix or Epoch time, and the number of seconds is since 1971 January 1. Almost all programming languages understand this standard.
Finally, when dealing with file names and paths, always use relative paths. We might be tempted to use the example below
However, this path makes sense only within a MS Windows environment. If someone were to try to use your project in a Mac, it would not work. Cross-platform functionality is never easy, but relative paths such as I have used elsewhere are preferred.
Naming things, directory structures, relative paths. They may not seem important at first. However, with experience, most people who work a lot with computer systems know how important it is. Naming things consistently and well will do several things for you. It will decrease your time cleaning up messes. It will also make it easier for you and your collaborators to understand and remember what you have done. Clean, well-organized data inspires trust, and may also increase the chances that it will be used, and cited by others. Remember, the primary goals are to make your data as organized and self-documenting as possible. Feel free to post suggestions and other ideas for naming as comments too! For such a seemingly simple thing, there are many ways of doing it well.