Data best practices and case studies

How to use best practices for managing your research data, along with case studies and examples to help you use these techniques.

File formats

The file formats you use have a direct impact on your ability to open those files at a later date and on the ability of other people to access those data.

You should save data in a non-proprietary (open) file format when possible, because 1) this will make your content more accessible to others who don't have access to proprietary software and 2) this will make it easier for you to access your own content in the future if the proprietary software no longer works on your machine or you no longer have access to it either.

If conversion to an open data format will result in some data loss from your files, you might consider saving the data in both the proprietary format and an open format. Having at least some of the information available to you later will be better than having none of it available.

When it is necessary to save files in a proprietary format, consider including a readme.txt file in your directory that documents the name and version of the software used to generate the file, as well as the company who made the software. This could help you down the road if you need to figure out how to open these files again.

The Library of Congress has published a Recommended Formats Statement that discusses this topic in great depth.

When selecting file formats for archiving, the formats should ideally be:

Non-proprietary
Unencrypted
Uncompressed
In common usage by the research community
Adherent to an open, documented standard, such as described by the State of California (see AB 1668, 2007)
- Interoperable among diverse platforms and applications
- Fully published and available royalty-free
- Fully and independently implementable by multiple software providers on multiple platforms without any intellectual property restrictions for necessary technology
- Developed and maintained by an open standards organization with a well-defined inclusive process for evolution of the standard.

Some preferred file formats for different content types include:

Containers: TAR, GZIP, ZIP
Databases: XML, CSV
Geospatial: SHP, DBF, GeoTIFF, NetCDF
Moving images: MOV, MPEG, AVI, MXF
Sounds: WAVE, AIFF, MP3, MXF
Statistics: ASCII, DTA, POR, SAS, SAV
Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
Tabular data: CSV
Text: XML, PDF/A, HTML, ASCII, UTF-8
Web archive: WARC

See the Library of Congress' Sustainability of Digital Formats web site for more complete listings and discussions of formats, including guidance for the preservation of data sets, geospatial data, and web archives. Or visit the LOC's page on Recommended Format Specifications for preservation.

These are examples from a collection of digital research data collected by Science Data Librarian Amy Hodge from 1997-1999 for her dissertation research. They illustrate some of the problems that you might experience if you do not choose appropriate file formats for your data.

The files in the screen shot below were saved in proprietary formats (.adt and .gel) produced by a piece of equipment called a phosphorimager. Amy no longer has access to the software, and does not remember its name. An internet search did not provide any useful information about software that might be able to open these files. To avoid an issue like this, you should elect, when possible, to save or export files into an open file format to better ensure that future access is maintained.

screenshot of file directory

In the instance shown below, the phosphorimager data was saved in the proprietary format, but was also exported as a .tif file. TIF is an open format that can be read and understood by a wide variety of software even 20+ years later.

Screenshot of file directory

In this example, the file named GROWTH CURVE EQUATION was saved as a .eqn file. Amy's computer could not automatically open this file, but it turned out to be readable in a text editor. This file should have been saved with a .txt extension (in addition to the .eqn file, if the software she was using required that extension). The file contains a script for calculating best fit curves for the growth curve analysis.

screenshot of file directory

The growth curve data shown was saved in a proprietary spreadsheet format called .pgw that Amy can no longer open. The software used to generate these files is not known. A search of the internet indicated that .pgw is a map file, which is incorrect. These data tables should have been exported as .csv or some other open format to preserve their future readability.

Last Updated: Sep 13, 2023 4:20 PM
URL: https://guides.library.stanford.edu/data-best-practices
Print Page

Subjects: Basic research, Data, Science and engineering, Social sciences

Tags: Best practices, Data, Research