Skip to Main Content

Data best practices and case studies

How to use best practices for managing your research data, along with case studies and examples to help you use these techniques.

Spreadsheet best practices

Icon representing data in its raw state1. Keep your raw data raw.

Keep one master copy of your raw data that you never touch

Icon representing the presence of empty cells 2. Avoid empty cells

These can trip up other software that are trying to read a contiguous data table. They also make it unclear if the data were not gathered, were not reported for some reason, or were accidentally deleted. If there is no data for the cell, indicate why.

Icon representing the presence of empty columns3. Avoid empty columns and rows

These make it appear as if data are missing. They may also indicate that there are multiple tables in the same worksheet; every table should be in its own worksheet.

Icon representing the use of special characters4. Avoid special characters

These may not be readable by other software programs, and they may be modified when you export your data.

Icon representing missing or multiple headers5. Avoid missing headers & headers in multiple places

Ideally, every table should have a single header row in the first row. Each column header should be unique. One row of headers is best; some software may have problems with multiple header rows.

Icon representing merged cells6. Avoid merged cells

These are not maintained when data are exported as a .csv file, which may cause information to be lost or confused. It is best not to use merged cells.

Icon representing the use of colors7. Avoid colored text and cell shading

These are not maintained when data are exported as a .csv file. Add an extra column where information indicated by colors can be captured.

8. Avoid commas

These may be misinterpreted when data are exported as a .csv file. They may also indicate multiple pieces of data in the same cell (like “city, state”) that should be captured in separate columns. Use other separators if necessary.

Icon representing the use of comments9. Avoid embedded comments

These are not maintained when data are exported as a .csv file or if the file is opened in other programs. Create a new column for listing comments.

Icon representing multiple data types10. Avoid multiple data types in a single column.

This can be very confusing, as can mixing numbers and text in the same column. Split data into multiple columns if necessary.

Clean your data

OpenRefine

OpenRefine (formerly Google Refine) is a desktop application that can help you clean up messy or inconsistent data by helping you locate those inconsistencies and providing easy ways to fix them. It can also help you transform your data and extend it with web services. All editing operations are tracked, allowing you to roll back to previous versions.

  • Platforms: Mac, PC, or Linux
  • Cost: Free

Case studies

This example is from a collection of research data collected by Science Data Librarian Amy Hodge from 1997-1999 for her dissertation research. At least five best practices are ignored in this first spreadsheet. See if you can spot the problems before you scroll down to reveal the answers!

Spreadsheet for example 1

 

Best practice problems

  • Headers should be in a single row. This spreadsheet contains headers in row 1 and row 2, as well as redundant headers for two columns in rows 14 and 26 that are confusing. There should be a single unique header for each column, preferably in row 1. In addition, many of the columns in this spreadsheet do not have headers at all!
  • Avoid the use of special characters. Special characters are often not exported correctly or not read correctly by other software programs. In this case, it would be easy to remove the percent signs from all of the cells in column H and then indicate in the header that the values in this column are percentages.
  • Do not leave empty rows or columns. These may cause problems when data are exported. Empty rows and columns also tend to indicate the presence of multiple tables in one sheet, which appears to be the case with this example. A sheet should contain only one table of data.
  • Do not leave empty cells. Empty cells sometimes cause problems for other software or when exporting data. Empty cells are also confusing, because it is unclear why the cell is empty. Was this not measured? Did the value seem unreliable so was omitted? Was this value deleted by accident? If a cell must be left empty, make a notation in a column of comments about why the cell is empty.
  • Do not embed charts, graphs, or images. They are not included when data are exported. Charts and graphs should be included in a separate sheet if necessary. These may also be exported as images from Excel. Also note that the chart in this spreadsheet has no labels of any kind!

This example is from a collection of research data collected by Science Data Librarian Amy Hodge from 1997-1999 for her dissertation research. At least three best practices are ignored in this second spreadsheet. See if you can spot the problems before you scroll down to reveal the answers!

Spreadsheet for example 2

 

Best practice problems

Do not use colored text or cell shading. This formatting is lost when the data are exported or the file is opened in another program. Consider adding another column where the information indicated by the coloring and shading can be included in text form.

Do not use commas. When data are exported as a comma separated value (.csv) file, the commas within your data cells can cause confusion. They also sometimes indicate that multiple pieces of data (like city, state) are included in a single column. Separate data into multiple fields if appropriate or use another kind of separator (colon, semicolon -- but not a symbol!) if one is necessary. 

Do not merge cells. Cell merging is typically lost when data are exported and is often misread by other software programs. This may result in the shifting of data cells in the affected rows or columns. In the case shown here, the cells in column 1 should be unmerged and the date information entered into the first cell in each row.