Guidelines for NHLBI Data Set Preparation
The purpose of this document is to provide information and guidance in the preparation of NHLBI data repository datasets and associated documentation for submission to the Biological Specimen and Data Repository Information Coordinating Center (BioLINCC) in accordance with the NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies.
Refer to NHLBI Clinical Research Guide Glossary for additional terms not identified.
Data - Information collected and recorded from study participants through periodic examinations and follow-up contacts, not to include original specimens or images.
Commercial purpose - Data will be considered as being for a commercial purpose if they are to be used by an investigator who is an employee of a for-profit organization, if they are to be used by an investigator to satisfy a contractual relationship with a for-profit organization, or if they are to be used by an investigator as the basis for a consulting relationship with a for-profit organization. Data will also be considered as being for a commercial purpose if the investigator(s) take any affirmative steps to facilitate commercial use of results derived from the data.
Non-Commercial Purpose Data Set - A data set consisting of all records except those for participants who requested that their data not be shared beyond the initial study investigators.
Commercial Purpose Data Set - A data set consisting of all records except those for participants who requested that their data not be shared beyond the initial study investigators or used for commercial purposes.
Non-Commercial Purpose Pedigree/Genetic Data Set - A pedigree/genetic data set consisting of all pedigree and genetic data except those for participants who requested that their data not be shared beyond the initial study investigators.
Commercial Purpose Pedigree/Genetic Data Set - A pedigree/genetic data set consisting of all pedigree and genetic data except those for participants who requested that their data not be shared beyond the initial study investigators or used for commercial purposes.
Overview of Responsibilities in Preparing Data Sets for Sharing
Investigators in NHLBI studies covered by the Policy for Data Sharing from Clinical Trials and Epidemiological Studies are required as part of the terms and conditions of their awards to prepare and deliver to the NHLBI data sets that satisfy NHLBI requirements. Included among these required components are the elimination of personal identifiers and the modification of other data elements so as to reduce the likelihood that any individual participant can be identified. Additional requirements include the provision of adequate dataset documentation to enable the use of prepared datasets by outside investigators as well as the submission of key study documents (protocol, data collection forms, manuals of procedures, etc.)
Two data sets, i.e., a Non-Commercial Purpose Data Set and a Commercial Purpose Data Set, and, if applicable, two pedigree/genetic data sets, i.e., a Non-Commercial Purpose Pedigree/Genetic Data Set and a Commercial Purpose Pedigree/Genetic Data Set, and associated documentation, must be provided in electronic form to the Institute. In addition, investigators must provide the Institute with two separate lists of participant identification numbers, one consisting of those participants who asked that their data not to be shared beyond the initial study investigators and the other of those participants who asked that their data not be used for commercial purposes.
Investigators in ancillary studies based on ongoing (parent) studies that are required by this policy to produce data sets must submit ancillary study data to the NHLBI through the parent study coordinating center or data submission process established by the parent study.
Types of Data to be Included in NHLBI Repository Data Sets
In addition to summary information, data sets include for each participant those raw data elements (e.g., food item data, individual electrocardiographic lead scores, etc.) that have not otherwise been processed into summary information.
- Clinical Trials - included are baseline, interim visit, ancillary data, and outcome data, along with laboratory measurements not otherwise summarized.
- Observational Epidemiology Studies - included are all of the examination data obtained in each examination cycle, ancillary data, and/or all of the follow-up information available up to the last follow-up cycle cutoff date.
Guidelines for Redaction/Summarization of NHLBI Data Sets
The NHLBI requires that the data be provided in a manner that protects the privacy of study participants. The Institute requires appropriate documentation of the steps taken to protect their privacy in preparing a data set. A summary of all proposed modifications and deletions to be made to a data set must be submitted to and approved by the NHLBI Data Repository representative prior to their implementation.
The following guidelines provide a framework for decision-making regarding preparation of data sets:
- All data for participants who refused to permit sharing their data with other researchers must be deleted from the Non-Commercial Purpose Data Set.
- All data for participants who only refused to permit sharing their data for commercial purposes must also be deleted from the Commercial Purpose Data Set.
- Participant identifiers:
- Obvious identifiers (e.g., name, addresses, social security numbers, place of birth, city of birth, contact data) must be deleted.
- New identification numbers must replace original identification numbers. Codes linking the new and original data should be sent to the NHLBI in a separate file, not included on the CD ROM, so that linkage may be made if necessary for future research.
- Variables that might lead to the identification of participants and of centers in multicenter studies, or variables that are sensitive, inaccurate, or of limited scientific utility:
- Clinical center identifier -- In trials or studies that have only a few centers and relatively few participants per center, the data set should not contain center identifiers. In trials that have either many centers or a large number of participants per center, the data may offer little possibility of identifying individuals. For them, the investigators and the NHLBI will determine whether to include them on a case-by-case basis.
- Interviewer or technician identification numbers must be recoded or deleted.
- Sensitive data, including illicit drug use, risky behaviors (e.g., carrying a gun or exhibiting violent behavior), sexual behaviors, and selected medical conditions (e.g., alcoholism, HIV/AIDS) must be deleted.
- Regional variables with little or no variation within a center because they could be used to identify that center must be deleted.
- Unedited, verbatim responses that are stored as text data (e.g., specified in "other" category) must be deleted
- Pedigree and genetic data will be distributed in separate data sets only to investigators specifically requesting them. Genotyping data for any person in whom potential pedigree errors are detected must be deleted.
- Dates: All dates should be coded relative to a specific reference point (e.g., date of randomization or study entry). This provides privacy protection for individuals known to be in a study who are known to have had some significant event (e.g., a myocardial infarction) on a particular date.
- Variables with low frequencies for some values, that might be used to identify participants, may be recoded. These might include:
- Socioeconomic and demographic data (e.g., marital status, occupation, income, education, language, number of years married).
- Household and family composition (e.g., number in household, number of siblings or children, ages of children or step-children, number of brothers and sisters, relationships, spouse in study).
- Number of pregnancies, births, or multiple children within a birth.
- Anthropometric measures (e.g., height, weight, waist girth, hip girth, body mass index).
- Physical characteristics (e.g., missing limbs).
- Detailed medication, hospitalization, and cause of death codes, especially those related to sensitive medical conditions as listed above, such as HIV/AIDS or psychiatric disorders.
- Prior medical conditions with low frequency (e.g., group specific cancers into broader categories) and related questions such as age at diagnosis and current status
- Parent and sibling medical history (e.g., parents' ages at death).
- Race/ethnicity and sex information when very few participants are in certain groups or cells.
- Polychotomous variables: values or groups should be collapsed so as to ensure a minimum number of participants (e.g., at least 20) for each value within each race-sex cell.
- Continuous variables: distributions should be truncated if needed to ensure that a minimum number of participants (e.g., at least 20) have the same highest and lowest values in each race-sex cell.
- Dichotomous variables: data should either be grouped with other related variables so as to ensure a minimum number of participants (e.g., at least 20) in each race-sex cell or deleted
- The investigators may realize that other variables may make it easy to identify individuals. All such variables should be recoded or removed. The NHLBI Data Repository representative should be consulted concerning such variables.
Dataset and Study Documentation
Documentation for data sets must be comprehensive and sufficiently clear to enable investigators who are not familiar with a data set to use it. The documentation must include data collection forms, study procedures and protocols, descriptions of all variable recoding performed, and a list of major study publications.
In addition, a summary documentation file, usually called a "readme" file, is required. It must provide a complete overview of the data and a description of their use for investigators who are not familiar with the data set. It must also contain a brief description of the study (including a general orientation to the study, its components, and its examination and follow-up timeline), a listing of all files being provided, a description of system requirements, a generation program code for installing a SAS file from the SAS export data file, and a frequency distribution for selected key variables.
Selected study documentation will be used to describe the study on the Data Repository website. Examples include Forms, Data Dictionaries, Descriptive Statistics, and the Study Protocol. These documents will need to be accessible to those with disabilities according to section 508 of the Rehabilitation Act. The HHS maintains a website devoted to 508 issues with links to resources on creating and checking accessibility at http://www.hhs.gov/web/508/index.html.
Format, Storage and Delivery of Study Materials
Both the comprehensive documentation and the summary documentation must be prepared in a consistent format, either as Word Perfect, MS Word, ASCII, or portable document format (PDF) files and included on the same storage medium as the data set. To ensure access by users with disabilities, all PDF files must be created in Adobe Acrobat version 5.0 or higher. Documentation that is not available in electronic form, such as data collection forms, should be scanned into a graphics file, converted to a PDF file using Adobe Acrobat version 5.0 or higher, and saved on the same medium as the data set. Pedigree data should be provided in a format readable by standard genetic analysis programs such as SAGE and SOLAR, with one individual's data per line beginning with pedigree identifier, individual's ID, father's ID, mother's ID, and individual's sex.
Data are to be stored on a CD ROM unless the investigators and the NHLBI mutually agree upon an alternative storage medium.
Data and study materials are to be sent to the NHLBI prior to the end of funding according to the timelines described in the NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies.
The following links highlight NIH policy and related guidance on sharing of research data developed with NIH funding.
NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies
Biological Specimen and Data Repository Information Coordinating Center - BioLINCC
6701 Rockledge Dr.
Bethesda, MD 20892
For questions and/or concerns regarding the content of this page, please contact the
Clinical Research Policy Manager
Last Updated: December 2011