+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      TCGA Expedition: A Data Acquisition and Management System for TCGA Data


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices.


          TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable.


          Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets.

          Related collections

          Most cited references8

          • Record: found
          • Abstract: found
          • Article: not found

          Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal.

          The cBioPortal for Cancer Genomics (http://cbioportal.org) provides a Web resource for exploring, visualizing, and analyzing multidimensional cancer genomics data. The portal reduces molecular profiling data from cancer tissues and cell lines into readily understandable genetic, epigenetic, gene expression, and proteomic events. The query interface combined with customized data storage enables researchers to interactively explore genetic alterations across samples, genes, and pathways and, when available in the underlying data, to link these to clinical outcomes. The portal provides graphical summaries of gene-level data from multiple platforms, network visualization and analysis, survival analysis, patient-centric queries, and software programmatic access. The intuitive Web interface of the portal makes complex cancer genomics profiles accessible to researchers and clinicians without requiring bioinformatics expertise, thus facilitating biological discoveries. Here, we provide a practical guide to the analysis and visualization features of the cBioPortal for Cancer Genomics.
            • Record: found
            • Abstract: found
            • Article: not found

            Next-generation sequencing platforms.

            Automated DNA sequencing instruments embody an elegant interplay among chemistry, engineering, software, and molecular biology and have built upon Sanger's founding discovery of dideoxynucleotide sequencing to perform once-unfathomable tasks. Combined with innovative physical mapping approaches that helped to establish long-range relationships between cloned stretches of genomic DNA, fluorescent DNA sequencers produced reference genome sequences for model organisms and for the reference human genome. New types of sequencing instruments that permit amazing acceleration of data-collection rates for DNA sequencing have been developed. The ability to generate genome-scale data sets is now transforming the nature of biological inquiry. Here, I provide an historical perspective of the field, focusing on the fundamental developments that predated the advent of next-generation sequencing instruments and providing information about how these instruments work, their application to biological research, and the newest types of sequencers that can extract data from single DNA molecules.
              • Record: found
              • Abstract: found
              • Article: not found

              Characterizing DNA methylation alterations from The Cancer Genome Atlas.

              The Cancer Genome Atlas (TCGA) Research Network is an ambitious multi-institutional consortium effort aimed at characterizing sequence, copy number, gene (mRNA) expression, microRNA expression, and DNA methylation alterations in 30 cancer types. TCGA data have become an extraordinary resource for basic, translational, and clinical researchers and have the potential to shape cancer diagnostic and treatment strategies. DNA methylation changes are integral to all aspects of cancer genomics and have been shown to have important associations with gene expression, sequence, and copy number changes. This Review highlights the knowledge gained from DNA methylation alterations in human cancers from TCGA.

                Author and article information

                Role: Editor
                PLoS One
                PLoS ONE
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                27 October 2016
                : 11
                : 10
                : e0165395
                [1 ]Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States of America
                [2 ]University of Pittsburgh Cancer Institute, Pittsburgh, PA, United States of America
                [3 ]Department of Human Genetics, University of Pittsburgh School of Public Health, Pittsburgh, PA, United States of America
                [4 ]Center for Simulation and Modeling, University of Pittsburgh, Pittsburgh, PA, United States of America
                [5 ]Pittsburgh Supercomputing Center, Carnegie Mellon University, Pittsburgh, PA, United States of America
                [6 ]Department of Pharmacology and Cell Biology, University of Pittsburgh, Pittsburgh, PA, United States of America
                [7 ]Magee-Women’s Research Institute, Pittsburgh, PA, United States of America
                [8 ]UPMC Corporate Services, Pittsburgh, PA, United States of America
                [9 ]Institute for Precision Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
                Flinders University, AUSTRALIA
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                • Conceptualization: URC MMB JMB RSJ.

                • Data curation: OPM PDB.

                • Formal analysis: URC OPM PDB AC SL AVL.

                • Funding acquisition: JMB RSJ.

                • Methodology: MMB RSJ.

                • Project administration: RSJ.

                • Resources: PDB AF KFW ZZ RB JRS.

                • Software: OPM PDB AC SL AF KFW ZZ RB JRS RSJ.

                • Supervision: RSJ.

                • Validation: OPM PDB AC SL AF KFW ZZ RB JRS RSJ.

                • Visualization: OPM RSJ.

                • Writing – original draft: URC MMB RDB RSJ OPM.

                • Writing – review & editing: URV OPM MMD PDB AB RSJ.

                © 2016 Chandran et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                : 17 March 2016
                : 11 October 2016
                Page count
                Figures: 3, Tables: 1, Pages: 14
                Funded by: funder-id http://dx.doi.org/10.13039/100000001, National Science Foundation;
                Award ID: 144064
                Funded by: funder-id http://dx.doi.org/10.13039/100000054, National Cancer Institute;
                Award ID: P30CA047904
                Funded by: University of Pittsburgh Institute for Personalized Medicine
                Award Recipient :
                We gratefully acknowledge support from the Institute for Precision Medicine at the University of Pittsburgh and the University of Pittsburgh Cancer Institute. The project used the UPCI Tissue and Research Pathology Services that is supported in part by award P30CA047904 from the National Cancer Institute ( https://na01.safelinks.protection.outlook.com/?url=www.cancer.gov&data=01%7C01%7Crebeccaj%40pitt.edu%7Cf6b178379b3b481e214008d3f8e68cb2%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=XmoeftTTF9W%2B%2BFuUmlFfSthqmb%2B2QpBFd%2FvAzjjIer4%3D&reserved=0). Upgrades to the Pitt networking infrastructure to support the collaboration were funded through National Science Foundation CC*IIE award #144064. This work used the Data Exacell, which is supported by National Science Foundation award number ACI-1261721, at the Pittsburgh Supercomputing Center (PSC). Additionally, this project used the UPCI Cancer Bioinformatics Services, which is supported in part by the National Cancer Institute award P30CA047904. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Research Article
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Biology and Life Sciences
                Genome Analysis
                Research and Analysis Methods
                Research Facilities
                Information Centers
                Computer and Information Sciences
                Data Management
                Computer and Information Sciences
                Software Engineering
                Software Tools
                Engineering and Technology
                Software Engineering
                Software Tools
                Medicine and Health Sciences
                Basic Cancer Research
                Cancer Genomics
                Biology and Life Sciences
                Genomic Medicine
                Cancer Genomics
                Engineering and Technology
                Human Factors Engineering
                Man-Computer Interface
                Graphical User Interface
                Computer and Information Sciences
                Computer Architecture
                User Interfaces
                Graphical User Interface
                Computer and Information Sciences
                Data Acquisition
                Research and Analysis Methods
                Database and Informatics Methods
                Custom metadata
                Source code and documentation needed to replicate this study are available from Github at the following two links: https://github.com/TCGAExpedition/tcga-expedition/blob/master/TCGA-Expedition.User.Guide.docx http://github.com/TCGAExpedition Please also note other relevant links to information: Project home page: https://www.ipm.pitt.edu/cancer-genome-atlas-project Training movie: https://www.youtube.com/watch?v=bpcQiBNf8Fc.



                Comment on this article