Skip to Main Content

Data best practices and case studies

How to use best practices for managing your research data, along with case studies and examples to help you use these techniques.

Work with sensitive data

All data at Stanford is classified into risk categories. Many researchers on our campus work with patient health or other personal information. These types of data are classified into different categories, each requiring its own level of security. 

Detailed descriptions and explanations of these data classifications can be found on University IT's web site. At the bottom of that page you'll find a chart of services that shows which ones can be used for which categories of data. 

If your research involves human subjects, your work may need to be overseen by Stanford's Institutional Review Board (IRB). The IRB's goal is to protect human research participants in both medical and non-medical research projects. You should contact the IRB when you are planning a research project involving human subjects. You may also want to review information from the Stanford Research Compliance Office on protecting the confidentiality of patient information, as well as information from the Dean of Research on human subjects and stem cells in research

Sensitive data that contain potentially identifying information -- whether it be human subject data or other types of sensitive data -- will likely need to be modified prior to sharing these data with the public. It is important that these modifications are made in order to protect participant confidentiality, the location of endangered wildlife, or for other relevant reasons. However, these modifications may affect the data to the point where reproducibility or additional subsequent research by others is no loner possible. You might consider retaining multiple versions of the data: one that is suitable for public release, and one that is suitable for further research but that is available on a highly restricted basis.

For patient health information (PHI), HIPAA privacy rules provide two methods for de-identification: the expert determination method and the safe harbor method. See the resources tab for information on these methods from the US Department of Health and Human Services.

Direct identifiers

These data point directly to an individual and are typically removed from data sets before sharing with the public.

These may include:

  • name
  • initials
  • mailing address
  • phone number
  • email address
  • unique identifying numbers, like Social Security numbers or driver's license numbers
  • vehicle identifiers
  • medical device identifiers
  • web or IP addresses
  • biometric data
  • photographs of the person
  • audio recordings
  • names of relatives
  • dates specific to individual, like date of birth, marriage, etc.

 

Indirect identifiers

These may seem harmless on their own, but can point to an individual when combined with other data. It has been recommended (see BMJ article on Resources tab) that datasets containing three or more indirect identifiers should be reviewed by an independent researcher or ethics committee to evaluate identification risk. Any indirect information not needed for the analysis should be removed. It may be reasonable to supply some of these types of data in aggregated form (like ranges of annual incomes instead of exact numbers).

Indirect identifiers may include:

  • place of medical treatment or doctor's name
  • gender
  • rare disease or treatment
  • sensitive data like illicit drug use or other "risky behaviors"
  • place of birth
  • socioeconomic data, like workplace, occupation, annual income, education, etc
  • general geographic indicators, like postal code of residence
  • household and family composition
  • ethnicity
  • birth year or age
  • verbatim responses or transcripts

"Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule," US Department of Health and Human Services, Office for Civil Rights.

Hrynaszkiewicz, I, Norton, ML, Vickers, AJ and Altman, DG. "Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers.BMJ 2010;340:c181.

"Preparing Data for Sharing" from the Inter-University Consortium for Political and Social Research (ICPSR). (2012). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (5th ed.). Ann Arbor, MI. 

Tools for sensitive information

The following are tools available to Stanford researchers who are collecting and managing patient health or other sensitive information. It is not recommended that you collect sensitive data using Excel. Use Excel only for analysis of de-identified or anonymized data. 

Visit Stanford Medicine Research IT for more information and to request a free consultation about using these tools or working with sensitive data.

REDCap

REDCap (Research Electronic Data Capture) is an application for building and managing online databases. The Stanford Center for Clinical Informatics (SSCI) runs and supports a secure, local Stanford installation of REDCap for the Stanford research community at no cost. REDCap provides a web-based interface for collecting data with data validation and includes the ability for automated export to statistical packages. The software also includes data logging for HIPAA compliance and the ability for administrators to define access rights on a per-user basis. Data stored in production REDCap databases is not automatically purged, but archiving of completed projects within REDCap is recommended. In the event the REDCap service were to be replaced or discontinued, all project owners would be notified and plan devised that would allow ample time for owners to export their data.

STARR

The STAnford Research Repository, or STARR, is Stanford Medicine's approved resource for working with clinical data for research purposes. STARR is a data resource and contains data from Stanford Health Care, and the Stanford Children’s Hospital. STARR supports diverse use cases and research applications. The STARR IRB permits the collection and aggregation of all data generated at Stanford for clinical care purposes, and articulates the formal approval process each research project must follow in order to obtain and work with this data for research purposes.

STARR is the home of two web tools, one for Cohort Discovery, the other for Chart Review.

Qualtrics

Qualtrics is an online survey tool with customizable templates, the ability to send and track invitations and reminders, and in-depth reporting. The service includes the ability to generate reports, view statistics, and export data for analysis. Qualtrics may be used to store and transmit Low, Moderate, and High Risk Data containing patient health information (PHI). It may not be used to store and transmit other types of non-PHI High Risk Data.