Accessing CGCI Data
- ANNOUNCEMENT -
Harmonized CGCI (phs000235) Data Released in the NCI GDC Data Portal
RNA-Seq and miRNA-Seq
- Targeted Sequencing
See GDC Data Release Notes for more information on data releases.
The CGCI produces large-scale genomic data sets for adult and pediatric cancers as a “community resource projectOpens in a New Tab”. Participants in the HIV+ Tumor Molecular Characterization Project (HTMCP) and the Burkitt Lymphoma Genome Sequencing Project (BLGSP) must be familiar with the CGCI Publication Guidelines.
Read the following user guide to learn how to search and download data generated by CGCI.
CGCI employs human subjects’ protection and data access policies to protect the privacy and confidentiality of the research participants. Depending on the risk of patient identification, CGCI data are available to the scientific community in two tiers: open- or controlled-access. Both types of data can be found through the CGCI Data Matrix.
Data within this tier present minimal risk of patient identification. Much of CGCI data, excluding patient identifiers, are open access. These data can be analyzed, for example, to make correlations between molecular subtypes and clinical outcomes. CGCI provides the scientific community the maximum amount of open-access data allowable under HIPAA guidelines. Researchers may explore open-access CGCI data content without restriction.
Researchers can access these data by clicking on any link labeled "DCC Open" in the CGCI Data Matrix.
- Clinical information that could not be used to identify the patient
- Tissue pathology data
- Gene expression data (other than 1º exon array data or mRNA-seq)
- Chromosome-specific (segmented) copy number alterations and loss of heterozygosity
- Tumor-associated (somatic) mutations
Data within this category present a small risk of patient re-identification. While stripped of direct patient identifiers as defined by HIPAA, controlled-access data contains specific patient/tumor information and unverified or raw molecular data (e.g., array-based and sequence files). Examples of controlled-access data are :
- Specific demographic and clinical data
- Specific genotype or phenotype data for each case
- Whole genome, exome or transcriptome sequences for an individual case
These data can be used to perform sophisticated bioinformatics analyses. Access to this data requires user certification which can be obtained through NCBI’s dbGaP (National Center for Biotechnology Information’s database of Genotypes and Phenotypes). Researchers apply for access by filling out a Data Access Request form. Read “How to Access Protected Data” below for more information.
General Outline of Instructions:
- Obtain Data Use Certification through dbGaP
- Maintain User Account for Data Access
- Access Data via the CGCI Data Matrix
(b) Use HHS credentials (intramural investigator) or eRA Commons credentials (extramural investigator) account to access data stored in NCBI databases
All users requesting access to controlled data must:
Have an eRA Commons account or HHS credentials (for intramural investigator) to submit requests for access. Further information can be found on the NCBI dbGaP homepage.
Complete the electronic dbGaP Data Access Request (SF 424 (R&R)) form, which requires a brief description of the investigator’s intended use of the data. To get approved for a Data Use Certification (DUC), the requestors must:
- Agree to use of the information for general research use, however, there are data use limitations for some cases contributed by specific tissue source sites.
- Agree not to try to identify and/or contact the patients.
- Submit requests that agree with the Data Use Limitations specific to the desired data’s appropriate consent group. Investigators must get a DUC for each consent group to gain access to all CGCI datasets.
CGCI Data Use Certifications
Types of Data
Data Use Limitations
Cancer Research and General Methods
HIV-related cancers (DLBCL, NSCLC and Cervical Cancer) and Burkitt Lymphoma
Use of the data is limited to scientific research relevant to the biology, prevention, treatment, and late complications of cancers and for the development of applications proposing analytical methods, software, and other research tools.
- Submit the completed SF 424 (R&R) form electronically to dbGAP for consideration of data access approval.
- Upon SF 424 (R&R) form submission, the signing official of the Principal Investigator’s institution will be notified of the submission and asked to certify agreement with the Data Use Limitations stated within the Data Access Request form.
- After the signing official has certified agreement, the SF 424 (R&R) application will be sent to the NCI Data Access Committee (DAC) to review for approval. The approval review process can take 2-4 weeks.
- Approval in the form of an individual DUC allows the investigator data access to that consent group’s data for one calendar year.
- Submit a progress report to the DAC no later than one year after obtaining the DUC. The requestor needs to understand that a progress report is a current condition for the data access. Approved users may also apply for renewal to access protected data at the same time they submit the reports. A reminder to submit an annual progress report and renew approval status, if needed, will be sent by the DAC staff approximately one month before the access termination deadline. If the requestor does not submit the progress report or requests a renewal, access to the data will cease.
Intramural investigators with an approved DUC may access protected CGCI data using their HHS credentials.
Investigators outside of HHS with an approved DUC require two separate user accounts to access protected CGCI data:
- For access to CGCI data stored and maintained at National Cancer Institute's (NCI )Genomic Data Commons (GDC) and at NCBI – approved users can access CGCI data stored at the GDC and at NCBI using the eRA Commons account associated with the original Data Access Request. CGCI data at the GDC include raw and aligned reads from next-generation sequencing (FASTQ and BAM files). CGCI data stored at NCBI includes aligned reads from next generation sequencing (BAM files) which are all accessible via the NCBI Sequence Read Archive (SRA).
- For access to data stored and maintained at the OCG Data Coordinating Center (DCC) at NCI – approved users outside of HHS are required to use eRA Commons account credentials to log on to Globus.org in order to access CGCI controlled-access data housed at the National Cancer Institute’s Office of Cancer Genomics Data Coordinating Center (NCI OCG DCC). (Approved PIs and designated downloaders will receive an email with detailed instructions on how to use Globus.org to access OCG DCC data upon approval.) This account will be used to access data at the OCG DCC, which includes most of the genomic data generated for the CGCI initiative (clinical information, all levels of chip-based molecular characterization, and higher level sequencing data). ***The password on this account needs to be updated every 90 days, but for some instances, can be extended. Instructions are distributed when the account is created***
Approved users may access protected CGCI data through the CGCI Data Matrix with either HHS credentials or with eRA Commons credentials via Globus.org (as outlined in #2).
A Globus.org account associated with eRA Commons is required to access controlled-access CGCI data. (Users may already have another non-eRA Commons-related Globus account set up, however, only eRA-commons Globus accounts can be used to access this dataset.)
- Access data stored at the OCG DCC directly through the CGCI Data Matrix (requires eRA Commons account for extramural investigators):
- Protected clinical information
- Raw chip-based molecular characterization data
- Processed sequencing data (upper level files, except for Epstein-Barr virus BAM files from pediatric Burkitt lymphoma cases)
- Access sequence files stored at NCBI indirectly through hyperlinks on the CGCI Data Matrix (requires eRA Commons account for extramural investigators):
- BAM files stored in the Sequence Read Archives (SRA) accessible through NCBI dbGaP
Next-generation whole genome, exome, mRNA-seq, miRNA-seq
- FASTQ/BAM files stored at the GDC accessible through NCI GDC website with eRA login
Whole genome, exome, mRNA-seq, miRNA-seq
- BAM files stored in the Sequence Read Archives (SRA) accessible through NCBI dbGaP
- For NCI-stored data – OCG@mail.nih.gov
- For NCBI-stored data – https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=email&from=login
CGCI data is accessible through tabular, easy-to-use Data Matrix. New data from ongoing projects gets incorporated into the Data Matrix as it becomes available.
The CGCI Data Matrix links to both open- and controlled-access CGCI data. To obtain specific datasets or metadata, including descriptions of each project, users can click on the text within the table and click to access the appropriate files.
- Raw or low level data files (level 1)
- Normalized and integrated data (levels 2 and 3)
- Summarized findings (level 4)
Data Access Code
- Brown = open access
- Red = controlled access (NCI & NCBI)
- Black = unavailable
Types of Data Found in the Matrix
- Names of diseases studied
- Clinical information, including outcomes
- Types of molecular data generated and platforms used
- Metadata descriptions about each individual project
- Multi-level chip-based and sequencing data links