Skip to Main Content

Data best practices and case studies

How to use best practices for managing your research data, along with case studies and examples to help you use these techniques.

Introduction

How you organize and name your files will have a big impact on your ability to find those files later and to understand what they contain. You should be consistent and descriptive in naming and organizing files so that it is obvious where to find specific data and what the files contain.

It's a good idea to set up a clear directory structure that includes information like the project title, a date, and some type of unique identifier. Individual directories may be set up by date, researcher, experimental run, or whatever makes sense for you and your research.

How to name files

File names should allow you to identify a precise experiment from the name. Choose a format for naming your files and use it consistently. 

You might consider including some of the following information in your file names, but you can include any information that will allow you to distinguish your files from one another. 

  • Project or experiment name or acronym
  • Location/spatial coordinates
  • Researcher name/initials
  • Date or date range of experiment
  • Type of data
  • Conditions
  • Version number of file
  • Three-letter file extension for application-specific files

Another good idea is to include in the directory a readme.txt file that explains your naming format along with any abbreviations or codes you have used.

Consider these additional tips as you develop a file naming scheme:

  • A good format for date designations is YYYYMMDD or YYMMDD. This format makes sure all of your files stay in chronological order, even over the span of many years.
  • Try not to make file names too long, since long file names do not work well with all types of software.
  • Special characters such as ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " and | should be avoided.
  • When using a sequential numbering system, using leading zeros for clarity and to make sure files sort in sequential order. For example, use "001, 002, ...010, 011 ... 100, 101, etc." instead of "1, 2, ...10, 11 ... 100, 101, etc."
  • Do not use spaces. Some software will not recognize file names with spaces, and file names with spaces must be enclosed in quotes when using the command line. Other options include:
    • Underscores, e.g. file_name.xxx
    • Dashes, e.g. file-name.xxx
    • No separation, e.g. filename.xxx
    • Camel case, where the first letter of each section of text is capitalized, e.g. FileName.xxx
  • Periods can be used in files names but consider these points before doing so and proceed cautiously:
    • Periods are used in regular expressions.
    • Periods at the start of a file name are used to indicate configuration and/or hidden files in a file directory.
    • Periods are used to separate file names from file extensions.

You may already have a lot of data collected for your project and wish to organize and rename these files for easier data management. If you have too many files to rename them all by hand, try one of the following applications for renaming your files:

Case studies

This is an example from a collection of digital research data collected by Science Data Librarian Amy Hodge from 1997-1999 for her dissertation research. It illustrates some of the problems that you might experience if you do not establish appropriate naming conventions for your files.

Screenshot of file directory

The good news

Amy still understands what some portions of these file names mean:

  • DAPI detects the location of DNA.
  • 1-284 refers to the portion of the protein that is present in the cells. Based on these numbers, Amy also knows what the protein was.
  • 12CA5 is an antibody. HA is the antigen that 12CA5 recognizes. 

The bad news

Some of the information in the file names no longer makes any sense to Amy. For example, she no longer knows what "-10," "-20" or "noPrim" refer to. She also no longer remembers what DM1A and 3F10 are, though they may be other antibodies. When the 12CA5 and HA notations are used in different file names do they indicate the same thing or different things about the experiments? Amy doesn't know.

These file names also lack a lot of Information that Amy would need to know to be able to understand what each of these experiments is, such as what kind of yeast were used in each experiment, whether the expression of the protein was turned on or not, and what portion of the protein is present for all those file names that do not say "1-284."

Best practices 

  • The files shown above are named inconsistently; a consistent naming scheme would have helped make their names more comprehensible. 
  • Use of more descriptive information in the file names would also have made it easier to figure out 20+ years later what the files contain. 
  • Including a readme.txt file in this folder with explanations of the experiments or at a minimum the naming scheme for the files would have also been helpful.

This is an example from a research project conducted by a group led by Professors Douglas McCauley and Fiorenza Micheli. It illustrates the organized and thorough method they used to name the thousands of image files that they collected for this project.

Images of one study tile in place at the Palmyra Atoll (left) and in the lab after collection.

The research

The project involved installing approximately 180 tiles in an underwater area near the Palmyra Atoll in the South Pacific and leaving them in place for a specified amount of time. At the end of that time, the plates were retrieved for analysis. The researchers photographed the plates in place during the research, and then again after they were retrieved. The images above show one particular plate in place during the study (left) and then again after retrieval (right).

The researchers wanted to track several things about the plates:

 

  • at which of the study sites the plate was installed
  • depth of the water at the site
  • date
  • number of the tile
  • whether the tile had been caged or uncaged
  • number assigned to photo by the camera
  • whether the post-removal photo was of the entire tile or only a certain section of the tile

 

The naming convention

Here is the general naming convention decided upon for the photographs:

 

  • Sites are named FR3, FR7, and FR9. Those designations are used in the file names.
  • File name is followed immediately by a letter to indicate depth. S=shallow, M=middle, D=deep. This is followed by a period.
  • Dates are formated as YYMMDD, for example 140623 is June 23, 2014. Dates are followed by a period.
  • Tile number (these are on the tiles)
  • Tile number is followed immediately by a letter to indicate treatment. C=caged and U=uncaged. This is followed by a period.
  • Photo number assigned by camera, followed by a period.
  • Single letter designation for photo coverage. W=whole plate, A=upper right, B=lower right, C=lower left, D=upper left (tiles are photographed in a uniform orientation when possible).

 

Example

The example photo shown on the right above was named using this convention as

FR3S.140623.129C.2653.W.JPG

How does this translate?

 

  • FR3 = study site FR3
  • S = shallow
  • 140623 = June 23, 2014
  • 129 = tile number 129
  • C = covered treatment
  • 2653 = photo number assigned by camera
  • W = whole tile

 

Imagine how easy it will be for these researchers to track these files and to search or scan through their thousands of images to find all the whole tile images, all the images from deep water, or all the images of tiles that had been uncaged.

The use of a well-documented and consistent naming scheme containing relevant and descriptive information about your files will make your research faster and easier to manage as well.

And don't forget to include your naming scheme documentation in a readme.txt file in your data folder.