Tag Archives: data management

The Art of Naming Things (Part 1 of 2)

This series of posts was written by Rob Olendorf, Prystowsky Early Career Science Librarian, Physical and Mathematical Sciences Library.

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

I’ve always liked that saying, although I think for non-programmers , it probably doesn’t have the power it should have. Like everyone else, I’ve spent a lot of time naming things, often with little thought. Over time I’ve learned to give it more and more thought. Why would programmers care so much about naming things? The truth is they haven’t always, but as the field has progressed, and software development has become more collaborative, more distributed and more open, many hard lessons have been learned. Perhaps the most obvious lesson is that, like everyone else, programmers forget things. Second, with any project of any size, there will be multiple people involved all with slightly different information, all forgetting things. Then with personnel turnover, there is even more loss. The third lesson is that knowing what a piece of code (or data) is supposed to do is not always easy, especially if  you weren’t originally involved with the project.

mcmfig1c

Hypothetical loss of data over time. (Michener et al. 1997, with this original caption: “Fig. 1. Example of the normal degradation in information content associated with data and metadata over time (“information entropy”). Accidents or changes in storage technology (dashed line) may eliminate access to remaining raw data and metadata at any time.”)

Research data shares a lot with software, especially if you take research data in the largest sense, which includes the analysis, visualization codes, processing software, settings, machine output and every thing else used to derive conclusions from the data. As the figure depicts, loss starts almost immediately with memory loss. Loss continues as the creators move on with their lives and, sadly for all of us, die. Also, for most datasets, without significant input from the creators, the data would be difficult for anyone outside of the project to understand as well. The trick then, is to come up with easy, and easy to learn, ways to bake meaning into your data. Software developers have actually done a lot of the work for us, and at the very base of this practice is the art of naming things. Naming is the the core of doing research that is organized, open and reproducible.

Let’s consider a very simple example of consisting of tabular data with one column (based on an real example). There is a list of numbers. Obviously the list of numbers meant something to the researchers, and they may have even published from it. However, for anyone else, it is meaningless and useless. It also doesn’t inspire a great amount of trust in the resulting publication. Obviously, were missing a column header, so lets say its temp. Great! At least we didn’t use “x”. We now know its temperature… or do we? It could be “temporary”, probably not, but it could be something about temporary storage on a computer system. Its good to be concise, but favor full words when in doubt.

  • Be expressive but concise.
  • Favor complete words over abbreviations or acronyms.
  • Name carefully, it can be hard to switch names later.

We rename or header to temperature and it’s much better. It can be improved, though. Is this parameter measured in Fahrenheit, Celsius or Kelvin? One solution, and not a bad one, is to create a data dictionary. A data dictionary is a separate document where you document additional information about your the columns in a table. They are a best practice, but suffer from one BIG problem – rot. This happens when you change a variable name, or something about it, but fail to update your data dictionary. This is a ubiquitous problem for all metadata and documentation. For this reason, I like to keep some information right into my data or code. I would prefer temperature_celsius in this case. Now the critical information is embedded directly in the data. An outside user is spared the trouble of looking things up in the data dictionary. I prefer to use the data dictionary for more difficult and usually less immediately critical information, such as the instrument used to measure the temperature.

  • Create a data dictionary.
  • Beware of rot. Keep critical information close to the data.
  • Document as you go. Include time to update documentation as needed.

Naming is just as important in every other part of your project as well. If you are writing analysis code, you should follow the same basic principles when naming variables in your code. If you are tempted to use x or i, just don’t. Its common practice to use these variables in a loops, but I prefer index generically, or if I am looping over a collection of items, say squids I would prefer the index to just be squid. Its an art, read your code (or data) and do what seems the most informative. When naming functions or methods, describe what the function does, and don’t be afraid to use enough words. For instance, if you need a special function to compute the mean temperature on mars call it compute_mars_mean_temperature or mars_mean_temperature.

One final consideration is consistency. Software developers can get very nit picky, often for good reason. For instance, within programming languages, there are usually best practices for when to use singular or plural. I suggest using plural and singular consistently. Avoid spaces – they will work well, until they don’t; then you’re stuck with renaming things. Instead of spaces, use the snake_case, or CamelCase. It often doesn’t matter what you choose, but by being consistent, you make things much more readable and organized. Before you start a project, get together with anyone who will be involved with the data and write down your style guide. Decide if you will use snake_case or CamelCase, and where they should be used. Try to make as many decisions as you can before you do anything. Include your style guide in your README file and keep that README file in the top level of your project directory. Anytime you modify or add to your rules, make a note in your README. The README is also a great place for general information about your project, an abstract, how to cite your data, how to rerun your analyses etc.

  • Be consistent.
  • Make a README file.

Following these practices,  you should be able to create a decent data file or analysis script. In my mind I always try to make things read like English as much as possible (assuming English is your language of choice). I have found that taking some time thinking about my names saves me time in the long run. Also, in collaborative situations. time spent planning this is with your collaborators is time not spent in the future trying to figure out what you as a group  did three years ago at the beginning of the project.

So far, I’ve discussed how to name single things. However, rarely will a project consist of a single file. Its not unusual for a even a modest project to generate 100,000+ files. In part 2 I’ll describe how we can expand these principles to allow us to make even very large projects transparent and understandable.

If You’re Curious

One book that really inspired me is Clean Code (Martin, 2008). Although written for programmers, with most examples in Java, the foundational concepts, especially in the first few chapters apply to a wide variety of situations.

Cited Works

Martin, Robert C., and Michael C. Feathers. Clean Code: A Handbook of Agile Software Craftsmanship. Upper Saddle River, N.J: Prentice Hall, 2009.

Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and Stafford, S.G. 1997. “Nongeospatial Metadata for the Ecological Sciences,” Ecological Applications 7: 330-342.

ScholarSphere Drop-In Hours: New Year, New Habits

It’s 2015 – do you know where your data are kept? Do you have publications you’ve been meaning to share in post-print or pre-print form? Need help understanding what IS a pre-print or a post-print? Or, perhaps more important, determining what’s OK to put in ScholarSphere?
Icon for uploading files, showing an arrow pointing upward.Want a walk-through of the service and get your questions answered about what it can do for you? Have ideas for the service you’d like to share?

The ScholarSphere Service Team has organized a few drop-in hours for anyone to come by and get advice and help on using ScholarSphere, or to talk with us about what you’re interested in seeing in the service in the future. We’ll be available in various instruction rooms on the following dates/times:

Graphic design showing nine yellow squares, intended to represent collections.

ScholarSphere overview sessions will be conducted for the campuses via Adobe Connect. Those will be organized for the latter half of February and early part of March – stay tuned!

If you’re needing a refresher on ScholarSphere, which has gone through several releases since 1.0 in September 2012, then these are the sessions for you. The start of a new year is also a great time to get serious about keeping track of your data sets and other research, with experts in the room to help and get you launched on a new habit!

Launching ScholarSphere

The repository services project to which other posts have alluded now has a name: ScholarSphere.

Penn State ScholarSphere is a new research repository service offered by the University Libraries and Information Technology Services, enabling Penn State faculty, staff, and students to share their scholarly works such as research datasets, working papers, research reports, and image collections, to name a few examples. ScholarSphere will make these works more discoverable, accessible, usable, and thus broadly recognized and known. 

The ScholarSphere service will help researchers actively manage stored versions of their research and preserve it, ensuring its longevity over time for future generations of scholars to find, use, and build on. The preservation functions include scheduled and on-demand verifications of deposited works, characterization of files to  mitigate future format obsolescence, regular file backups, and replication to disaster recovery sites.
The repository renders research works immediately citable via stable, short URLs and metadata about research is immediately exportable to citation managers. ScholarSphere enables documentation and description of research data for optimal discovery and curation of data through their lifecycle of use and reuse. 
Researchers will be able to share works stored in ScholarSphere with the Penn State community either by sharing directly with specified individuals or with established groups. Researchers will also be able to share each of their files at different access levels including read-only and edit modes, allowing full control over who can view and edit deposited works. 
A trusted institutional service, ScholarSphere has safeguards in place for keeping private research secure and unchanged over time, as researchers warrant, as well as for keeping access restricted to the individual researcher. 
ScholarSphere will be undergoing usability and accessibility testing throughout the summer for a beta release in September of 2012.  Stay tuned for more information about the ScholarSphere launch and about the technologies underlying ScholarSphere.