Ten Simple Rules to Enable Multi-site Collaborations through Data Sharing

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Open access, open data, and software are critical for advancing science and enabling collaboration across multiple institutions and throughout the world. Despite near universal recognition of its importance, major barriers still exist to sharing raw data, software, and research products throughout the scientific community. Many of these barriers vary by specialty [1], increasing the difficulties for interdisciplinary and/or translational researchers to engage in collaborative research. Multi-site collaborations are vital for increasing both the impact and the generalizability of research results. However, they often present unique data sharing challenges. We discuss enabling multi-site collaborations through enhanced data sharing in this set of Ten Simple Rules. Collaboration is an essential component of research [2] that takes many forms, including internal (across departments within a single institution) and external collaborations (across institutions). However, multi-site collaborations with more than two institutions encounter more complex challenges because of institutional-specific restrictions and guidelines [3]. Vicens and Bourne focus on collaborators working together on a shared research grant [4]. They do not discuss the specific complexities of multi-site collaborations and the vital need for enhanced data sharing in the multi-site and large-scale collaboration context, in which participants may or may not have the same funding source and/or research grant. While challenging, multi-site collaborations are equally rewarding and result in increased research productivity [5, 6]. One highly successful multi-site and translational collaboration is the Electronic Medical Records and Genomics (eMERGE) network (URL: https://emerge.mc.vanderbilt.edu/) initiated in 2007 [7]. The eMERGE network links biorepository data with clinical information from Electronic Health Records (EHRs). They were able to find novel associations and replicate many known associations between genetic variants and clinical phenotypes that would have been more difficult without the collaboration [8]. eMERGE members also collaborated with other consortiums and networks, including the Alzheimer’s Disease Genetics Consortium [9] and the NINDS Stroke Genetics Network [10], to name a few. Other successful collaborations include OHDSI: Observational Health Data Sciences and Informatics (http://www.ohdsi.org/), which builds off of the methodology from the Observational Medical Outcomes Partnership (OMOP) [11], and CIRCLE: Clinical Informatics Research Collaborative (http://circleinformatics.org/). In genetics, there are many consortiums, including ExAC: The Exome Aggregation Consortium (http://exac.broadinstitute.org/), the 1000 Genomes Project Consortium (http://www.1000genomes.org/), the Australian BioGRID (https://www.biogrid.org.au/), The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/), Genotype-Tissue Expression Portal (GTEx: http://www.gtexportal.org/home/), and Encyclopedia of DNA Elements at UCSC (ENCODE: https://genome.ucsc.edu/ENCODE/) among others. Based on our experiences as both users and participants in collaborations, we present ten simple rules on how to enable multi-site collaborations within the scientific community through enhanced data sharing. The rules focus on understanding privacy constraints, utilizing proper platforms to facilitate data sharing, thinking in global terms, and encouraging researcher engagement through incentives. We present these ten rules in the form of a pictograph of modern life ( Fig 1 ), and we provide a table of example sources and sites that can be referred to for each of the ten rules ( Table 1 ). Please note that this table is not meant to be exhaustive, only to provide some sample resources of use to the research community. 10.1371/journal.pcbi.1005278.g001 Fig 1 Modern life context for the ten simple rules. This figure provides a framework for understanding how the “Ten Simple Rules to Enable Multi-site Collaborations through Data Sharing” can be translated into easily understood modern life concepts. Rule 1 is Open-Source Software. The openness is signified by a window to a room filled with algorithms that are represented by gears. Rule 2 involves making the source data available whenever possible. Source data can be very useful for researchers. However, data are often housed in institutions and are not publicly accessible. These files are often stored externally; therefore, we depict this as a shed or storehouse of data, which, if possible, should be provided to research collaborators. Rule 3 is to “use multiple platforms to share research products.” This increases the chances that other researchers will find and be able to utilize your research product—this is represented by multiple locations (i.e., shed and house). Rule 4 involves the need to secure all necessary permissions a priori. Many datasets have data use agreements that restrict usage. These restrictions can sometimes prevent researchers from performing certain types of analyses or publishing in certain journals (e.g., journals that require all data to be openly accessible); therefore, we represent this rule as a key that can lock or unlock the door of your research. Rule 5 discusses the privacy issues that surround source data. Researchers need to understand what they can and cannot do (i.e., the privacy rules) with their data. Privacy often requires allowing certain users to have access to sections of data while restricting access to other sections of data. Researchers need to understand what can and cannot be revealed about their data (i.e., when to open and close the curtains). Rule 6 is to facilitate reproducibility whenever possible. Since communication is the forte of reproducibility, we depicted it as two researchers sharing a giant scroll, because data documentation is required and is often substantial. Rule 7 is to “think global.” We conceptualize this as a cloud. This cloud allows the research property (i.e., the house and shed) to be accessed across large distances. Rule 8 is to publicize your work. Think of it as “shouting from the rooftops.” Publicizing is critical for enabling other researchers to access your research product. Rule 9 is to “stay realistic.” It is important for researchers to “stay grounded” and resist the urge to overstate the claims made by their research. Rule 10 is to be engaged, and this is depicted as a person waving an “I heart research” sign. It is vitally important to stay engaged and enthusiastic about one’s research. This enables you to draw others to care about your research. 10.1371/journal.pcbi.1005278.t001 Table 1 Example sources and sites for each of the ten simple rules. Rule Example Site Rule 1: Make Software Open-Source Github https://github.com CRAN https://cran.r-project.org Bioconductor https://www.bioconductor.org Rule 2: Provide Open-Source Data (When Possible) Deposit Source Data in Appropriate Repositories Sequence Read Archive (SRA) https://www.ncbi.nlm.nih.gov/sra Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/geo ClinVar https://www.ncbi.nlm.nih.gov/clinvar Consider Middle-Ground Data Sharing Approaches for Sensitive Data dbGaP https://www.ncbi.nlm.nih.gov/gap Shared Health Research Information Network (SHRINE) https://catalyst.harvard.edu/services/shrine BioGrid Australia https://www.biogrid.org.au Rule 3: Use Multiple Platforms to Share Research Products Figshare https://figshare.com Github https://github.com ExAC Browser http://exac.broadinstitute.org Google Forums Rule 4: Secure Necessary Permissions/Data Use Agreements A Priori Guides for Creating a DUA Department of Health and Human Services Best Practice Guide for DUA http://www.hhs.gov/ocio/eplc/EPLC%20Archive%20Documents/55-Data%20Use%20Agreement%20(DUA)/eplc_dua_practices_guide.pdf Health Care Systems Research Network DUA Toolkit http://www.hcsrn.org/en/Tools%20&%20Materials/GrantsContracting/HCSRN_DUAToolkit.pdf Example DUAs NASA DUA http://above.nasa.gov/Documents/NGA_Data_Access_Agreement_new.pdf SEER-MEDICARE DUA https://healthcaredelivery.cancer.gov/seermedicare/obtain/seerdua.docx Rule 5: Know the Privacy Rules for Your Data Health Insurance Portability and Accountability Act (HIPAA) http://www.hhs.gov/hipaa/for-professionals/privacy Rule 6: Facilitate Reproducibility Resources for Increasing Research Reproducibility MetaSub Research Integrity and Reproducibility http://metasub.org/research-integrity-and-reproducibility/ Reproducibility and Open Science Working Group—GitHub http://uwescience.github.io/reproducible/guidelines.html https://github.com/uwescience/reproducible Example Projects with Assessed Reproducibility eMERGE PheKB https://phekb.org/network-associations/emerge Rule 7: Think Global Guides for Collaborating Globally National Academies “Collaborating with Foreign Partners to Meet Global Challenges”Resources http://sites.nationalacademies.org/PGA/PGA_041691 Global Alliance for Genomics and Health http://genomicsandhealth.org/work-products-demonstration-projects/catalogue-global-activities-international-genomic-data-initiati The Global Strategy of the US Department of Health and Human Services http://www.hhs.gov/sites/default/files/hhs-global-strategy.pdf Examples of Successful International Projects Human Fertility Database http://www.humanfertility.org/cgi-bin/main.php Human Mortality Database http://www.mortality.org Rule 8: Publicize Your Work Research Without Novelty Requirement PLOS ONE http://journals.plos.org/plosone Scientific Reports http://www.nature.com/srep Cell Reports http://www.cell.com/cell-reports/home Data Resources (Web Browsers, Databases) Scientific Data http://www.nature.com/sdata Database https://database.oxfordjournals.org Pure Open Science Research (all data must be open) F1000 https://f1000research.com Rule 9: Stay Realistic Retraction Watch retractionwatch.com Rule 10: Be Engaged Resources to Facilitate Researcher Engagement KNAER Creating Partnerships: Learning New Ways to Connect http://www.knaer-recrae.ca/blog-news-events Example Projects with Researcher Engagement STAN http://mc-stan.org STAN “swag” http://mc-stan.org/shop Definitions In this paper, we use the term “research product” to include all results from research. This includes algorithms, developed software tools, databases, raw source data, cleaned data, and various metadata generated as a result of the research activity. We differentiate this from “data,” which comprises the primary “facts and statistics collected together for analysis” for that particular collaboration. Therefore, data could include genetic data or clinical data. By these definitions, developed software tools are not “data” but “research products.” Novel genetic sequences collected for analysis would be considered “raw source data,” which is a type of “research product.” Rule 1: Make Software Open-Source The cornerstone of facilitating multi-site collaborations is to enhance data sharing and make software open-source [12]. By allowing the source code to be open, researchers allow others to both reproduce their work and build upon it in novel ways. To engage in multi-site collaborations, it is necessary for collaborators to have access to code in a repository that is shared among collaborators (although, this could be private and not open to the general public). When the study is complete and the paper is under review and/or published, a stable copy of the code should be made available to the general public. Internal sharing allows the code to be developed, while public sharing of a stable version allows the code to be refined and built upon by others. Many researchers still limit access to their work despite the known advantages of making software open-source upon publication (e.g., higher impact publications [5]). For example, they allow users to interact with their algorithm by inputting data and receiving results on a web platform, while the backend algorithm often remains inaccessible. Masum et al. advocate the reuse of existing code in their Ten Simple Rules for cultivating open science [13]. However, this is often easier said than done. As long as the backend algorithms remain hidden, open science will not be possible. Therefore, it is essential for researchers interested in participating in multi-site collaborations to make their software code and algorithms open. Because making software truly “open” can be complex, Prlic and Proctor provide Ten Simple Rules to assist researchers in making their software open-source [12]. Truly open-source software is an essential component in collaborations [13]. Openness also has advantages for the researchers themselves. With more eyes on the source code, others within the community can refine the code, leading to greater identification and correction of errors. There are several methods for sharing software code. If you use the R platform, then libraries can be shared with the entire open-source community via CRAN (https://cran.r-project.org/) and bioconductor, which is specifically for biologically related algorithms (https://www.bioconductor.org/). Code can also be shared on Github with issue trackers for error detection. Rule 2: Provide Open-Source Data Deposit Source Data in Appropriate Repositories Whenever possible, it is important to make source data available. Openness benefits your collaborators by allowing them to perform additional analyses easily. Source data could include not only processed or cleaned data used in algorithms but also raw data files. These files can often be very large; therefore, they are often stored in some external site or data warehouse. The National Center for Biotechnology Information (NCBI) maintains the Sequence Read Archive (SRA) (https://www.ncbi.nlm.nih.gov/sra) and the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/); both are great places to deposit source data, if appropriate. In addition to raw data files, it is also helpful to provide intermediate data files at various stages of processing. If comparing your results to those in the literature, it can also be useful to provide a meta-analysis with publications (along with PubMed IDs) that detail those publications that support and refute the results you obtained. Data sharing is vitally important for multi-site collaborations by allowing researchers to compare results from across vastly different study populations, which increases the generalizability of the findings [14]. While a multi-site research project is still ongoing, data can be shared in a private shared space until all necessary data quality checks have been conducted and the findings have been published. After publication, data can be deposited in GEO, SRA, ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/), and any other domain-specific sites that are appropriate for source data deposition. Consider Middle-Ground Data Sharing Approaches for Sensitive Data Raw source data is not always fully shareable with the public. This can be because of data use restrictions (see rule 4) or privacy concerns (see rule 5). Alternative mechanisms exist for sharing portions of data with the research community. For example, the database for Genotypes and Phenotypes or dbGaP (https://www.ncbi.nlm.nih.gov/gap) provides data holders with two levels of access: open and controlled. The open selection allows for broad release of nonsensitive data online, whereas the controlled release allows sensitive datasets to be shared with other investigators, provided certain restrictions are met. This increases the ability for researchers to share portions of their data that would not be shareable otherwise. In addition to the restricted data sharing option provided by dbGaP, others have looked at ways of developing middle-ground approaches for sharing sensitive raw data or metadata. Several of these mid-level approaches use Federated Access systems that allow researchers to query databases containing sensitive data while preventing direct access to the data itself. An example within the United States is the Shared Health Research Information Network (SHRINE), which provides a Federated system that is HIPAA compliant [15]. International groups have also seen success in this area. BioGrid Australia (https://www.biogrid.org.au/) allows researchers to access hundreds of thousands of health records through a linked data platform where individual data holders maintain control of their data [16]. Researchers can then be provided with authorized access to certain elements within the data while restricting access to private sections of the medical data. These mid-level approaches facilitate collaboration both within the institution (i.e., across departments) and across institutions by allowing researchers to access sensitive data indirectly. They can even match patients to similar patients (for association analyses) while maintaining stringent privacy constraints [17]. Others provide summary statistics computed over large cohorts (e.g., ExAC browser/database), which maintains privacy while providing others with important information about the populations that can be used in subsequent analyses and comparisons. Rule 3: Use Multiple Platforms to Share Research Products To collaborate with researchers from different backgrounds, it is often necessary to use multiple platforms when sharing data (as different disciplines often have different policies). Using multiple platforms allows individuals from diverse backgrounds to have access to your research product. General phrases like “open data” and “open science” are phrases used commonly in the research community but provide little direction [13]. Research products take many different forms, including 1) raw source data regardless of collection type (e.g., health data, genomic data, survey data, and epidemiological data), 2) software code (mentioned in rule 1), and 3) metadata elements and results of computations used to generate figures published in scientific research. Some data types cannot be fully shared (e.g., EHR data—see rule 5), but most algorithms and summary results/statistics are shareable. Each of these types of open data necessitates a different platform for data sharing. Figshare (https://figshare.com/) allows users to share data involving published figures. Github (https://github.com/) allows users to share code that is in development or published. For code that is well developed, open-source packages can be created, for example, an R library, which can be deposited in CRAN or bioconductor. R libraries can be shared immediately on github without any code checking—this is advisable for code that is still in development. However, when code is finalized, it can be submitted to bioconductor as an R library. Approved libraries are vetted to ensure that the code works well. Vignettes are also good to write to help new users get used to the R package. When collaborating across multiple sites, it is also important to have vignettes and sample source data to help users learn how to use the code even if R is not your language of choice. Data formats, differences among formats, and programming languages are important to consider when sharing data across multiple platforms. Different platforms often have different required formats. While it may seem tedious to translate code, source data, and documentation across multiple formats and data schemas, it can be very helpful, and it will increase the number of users that will find your data and results interesting. To facilitate communication among members of a collaborative effort, there are many options, including Google forums and wiki webpages, among others. Others have specially designed websites for the sole purpose of allowing users to browse and download the data directly; one such website is the ExAC Browser (http://exac.broadinstitute.org/), which integrates data obtained from 17 different consortiums (http://exac.broadinstitute.org/about) [18]. Rule 4: Secure Necessary Permissions/Data Use Agreements A Priori Some datasets have provisos that affect publication, and these need to be addressed a priori. For example, the ability for researchers to publish an algorithm that uses a Government dataset can depend on the department that generated the data. For example, certain National Aeronautics and Space Administration (NASA) datasets stipulate that data usage requires users to add certain NASA employees to subsequent publications. This is an important stipulation. Others may disallow the deposition of data into an “open” platform as part of their data use agreements (http://above.nasa.gov/Documents/NGA_Data_Access_Agreement_new.pdf). These stipulations can hinder researchers attempting to produce transparent science. Other datasets have data use agreements as an added layer to ensure that patients are protected. For example, the Surveillance, Epidemiology, and End Results (SEER) dataset linked with Medicare (i.e., SEER-Medicare dataset) requires that users submit the intended publication to their offices for pre-submission approval. This can seem burdensome to researchers; however it is a condition of the data use agreement and, therefore, must be complied with. Researchers need to be aware of all provisos when including such data in their studies. Before publishing, or providing data in any type of platform whether open, restricted, or closed, it is important to secure all necessary provisions and data use agreements. Rule 5: Know the Privacy Rules for Your Data Data come with many caveats. For this reason, it is important to understand what you can and cannot do (i.e., the privacy rules) with your data. Keeping and maintaining data privacy is different from data use agreements (DUA, see rule 4). For example, data that is not sensitive may have restrictive DUAs for other reasons (e.g., data from a collaborator in industry). Also, privacy rules often involve your own source data, whereas DUAs become necessary when using data from collaborators or a government source. Certain datasets, e.g., genomic and EHR data, may be impossible to fully publish on an open platform due to the Health Insurance Portability and Accountability Act (HIPAA) privacy rules and other privacy concerns related to patient re-identifiability (http://www.hhs.gov/hipaa/for-professionals/privacy/). Therefore, it is important to know the privacy stipulations of all data used in your collaborations and how this affects the ability to share results among members of the team (especially when members of the team are at different institutions). Methods that anonymize patient information while allowing patient-level data sharing may be the way of the future [19]. However, institutional-specific policies and/or country-specific laws can limit or prevent usage of such methods. This is an important item to consider and discuss with all collaborators at the outset of any collaboration. We discuss some methods that can be used to provide some forms of sensitive data in a shareable federated space in rule 2. Rule 6: Facilitate Reproducibility Another aspect of both data sharing and enabling multi-site collaborations is reproducibility. Sandve et al. provide Ten Simple Rules for facilitating research reproducibility in general [20]. Keeping track of research results and how data were generated is vital for reproducibility [20]. This site-level record keeping becomes vital when engaging in multi-site collaborations. If one aspect of a methodology is not conducted in the same way at one site, the overall results can be affected in drastic ways. In other words, reproducibility is a core requirement for successful collaborations. In genetics and computational biology, the issue of standardizing results from across different types of gene sequencing platforms is a major issue [21]. Researchers that use a mixture of clinical and genetic data (for Phenome-Wide Association Studies, PheWAS [22]) often depend on local EHR terminology systems for identifying patient populations. Therefore, standard phenotype definitions are required and must be harmonized across multiple sites to ensure that the definitions are accurate at each site [23]. Several multi-site collaborations have developed platforms that provide links to all necessary documentation, code, and data schemas to help facilitate this process [24], including the eMERGE network. This step is integral to data sharing and enabling multi-site collaborations. Rule 7: Think Global The importance of thinking globally cannot be overstated. Health care, genetics, climate, and all aspects of science affect the world as a whole. Therefore, it is important to think globally when performing scientific research. Most software languages are designed to be agnostic to the local language of the country. However, understanding and using these languages requires adequate documentation and user manuals to be provided in the local languages of the programmers/implementers. Despite this, open-source languages often provide user manuals in certain languages. For example, R is a popular open-source language yet has official documented translations in only four languages: English, Russian, German, and Chinese (https://www.r-project.org/other-docs.html). Problems can surface when collaborators in different regions run into difficulties with running R. This affects data sharing on a global scale and should be considered when collaborating on an international venue. Translational mechanisms may also be necessary to understand and to harmonize country-specific terminology. This is especially important as definitions for obesity and many psychiatric conditions vary widely across the globe [25]. Even seemingly simple biological features (e.g., tall versus short) can be difficult to translate in global terms. For example, an average height Norwegian may appear to be tall in a different country. Translating biological features to common absolute metrics (e.g., height) helps to alleviate ambiguities that can occur from categorical variables. Certain diseases, especially psychiatric conditions, are extremely important to study at the multi-site level to increase the generalizability of the results [14]. However, psychiatric conditions are more difficult to translate without a thorough knowledge of how the condition is defined in the underlying country or region [25]. Solutions often involve using concrete measures, e.g., brain imaging analysis, versus subjective measures such as depression presence or absence [14]. There are many layers to thinking on a global scale. There are mechanical differences (i.e., the software language and documentation) and also the conceptual differences (i.e., country- or region-specific medical definitions). Organizations such as the World Health Organization work tirelessly to integrate different conceptual interpretations of diseases into a standard guideline. Using these guidelines and not a country-specific guideline helps your research work reach the broader scientific community. Several groups have successfully integrated data across multiple countries and provided their data in an open form. The Max Planck Institute for Demographic Research (MPIDR) in Germany collaborated with two separate groups to produce two databases containing international data. Both datasets contain integrated results from over 30 countries. Additionally, all finished data (after cleaning) is made available to users in an open format via two specially designed databases: the Human Fertility Database (http://www.humanfertility.org/cgi-bin/main.php) [26] and the Human Mortality Database (http://www.mortality.org/) [27]. Only cleaned data are returned to users in a standardized format, allowing users to easily compare countries with one another. The MPIDR collaborated with the Vienna Institute of Demography (Austria) in creating the Human Fertility Database and the University of California, Berkeley for the Human Mortality Database. They provide a good example of a group that successfully harmonized definitions across countries by overcoming international barriers, and they provided data back to researchers in an easily useable and standardized format. The group provides detailed descriptions of how they harmonized various timescales across countries in a methods document (http://www.humanfertility.org/Docs/methods.pdf) that could easily be submitted as a research report (see Rule 6). Rule 8: Publicize Your Work Publishing all aspects of your work in the appropriate venues is vital for maintaining a multi-site collaboration. This enables each aspect of your research to be assessed by appropriate peer reviewers. Publishing different aspects of your work in separate papers in separate journals allows your contributions to be seen by those most able to learn from your work. Remember, it is important to make your research work available to those who can benefit from your results. Depending on your findings, this can include methodologists, clinicians, epidemiologists, geneticists, and others. New journals have been developed recently to facilitate open science, which are focused on certain aspects of research. For instance, there are several journals that do not require novelty as a requirement such as PLOS ONE, Scientific Reports, and Cell Reports. These journals are good choices for research results that may be part of a larger research project or collaborative but are not inherently novel. Other journals, such as Scientific Data and Database, are good choices for publishing a resource containing your collected research source data. It is often advisable to publish in data-focused journals simultaneously with an algorithm or results-focused paper that highlights the novel aspects of your research. In some cases, data can be published afterwards if it is part of a large collaborative and the database or user-interface is in production at the time that the main contribution is published. Publishing in multiple venues is highly important for those engaged in multi-site collaborations, because these projects often involve a tremendous investment of time and resources from across many different organizations. Therefore, it is vital to highlight each and every research contribution that the collaboration has generated to facilitate further engagement from the community. If you are able to provide all raw source data on an open platform, there are new journals designed specifically to facilitate open science such as F1000 (https://f1000research.com/) that may be worth considering. F1000 is also a great source for intermediate results such as posters, which collaborators may have presented at various conferences while working towards the final finished paper. After publication, some collaborative groups effectively utilize blogging (both macro and micro) to communicate with other researchers and the general public. However, it is also important not to overstate the claims in any paper submission/publication or media regarding that publication but to stay focused on the individual contribution of that particular work. Rule 9: Stay Realistic, but Aim High When performing quality research, and collaborating with others, it is important not to overstate the claims of your research—either in publication or online. It is vitally important to resist the urge to overstate the claims and to remain both humble and grounded. This is critical in collaborations because if a researcher overstates the claim in a paper, or worse, shares data publicly that he or she is unable to do legally (e.g., via the stipulations in a DUA), then the paper may be retracted. This could result in irreparable damage to the collaborative group. This rule also links back to rule 2—making the source data available. This allows others in the research community to check your work interactively, which can help prevent overstating research claims [28]. A site exists that posts retracted journal articles on a public forum, retractionwatch.com. The site includes not only instances of plagiarism and fabrication of data but also papers that are retracted due to human error on the part of an experiment (e.g., a protocol was not followed exactly as specified in the paper) or on the part of the analysis (e.g., the wrong type of statistical test was performed, making the conclusions not substantiated by the data). So, stay realistic, but do not be afraid to challenge the status quo. Some of the most respected research today was research that challenged the current understanding of the leading scientists at that point in time; this includes the seminal works on Pangaea and even that DNA is composed of a double helix. These concepts were earth-shattering at the time and could have been completely wrong, but the researchers backing them were not afraid to make their theories, data, and results public. These are the things that change science. So, remain humble, do not intentionally overstate the claims of your research, but at the same time do not be afraid to challenge the current mindset and way of thinking. You may be completely off, or you may just be a groundbreaking innovator. Rule 10: Be Engaged Be engaged with those using your research, your data, and your code. Communicate with them using various software social platforms—Github, figshare, and so forth. Respond readily when users have questions and concerns. Attempt to follow the motto—release early, release often. Engage with researchers in non-traditional ways. For example, several collaborative efforts have created their own gear, e.g., t-shirts, to engage the community. One such collaborative is the open-source statistical modeling language—STAN (http://mc-stan.org/). They have created their own line of STAN “swag” (http://mc-stan.org/shop/) to facilitate user engagement. Communicate often with the research community to convince them your research is worth caring about. The bottom line in collaboration is to care deeply about your research. If you care and you make it known that you care deeply about the problem, then it becomes possible to convince others that your research is important. Concluding Remarks Collaborations, especially large, multi-site collaborations, contain a lot of pitfalls that must be overcome. In this paper, we present ten simple rules that will help researchers share their data and methods to facilitate successful and meaningful multi-site collaborations. We describe these rules and highlight several successful multi-site collaborations.

Related collections

Most cited references 14

Record: found
Abstract: found
Article: found

Is Open Access

Ten Simple Rules for Reproducible Computational Research

Geir Sandve, Anton Nekrutenko, James Nick Taylor … (2013)

Replication is the cornerstone of a cumulative science [1]. However, new tools and technologies, massive amounts of data, interdisciplinary approaches, and the complexity of the questions being asked are complicating replication efforts, as are increased pressures on scientists to advance their research [2]. As full replication of studies on independently collected data is often not feasible, there has recently been a call for reproducible research as an attainable minimum standard for assessing the value of scientific claims [3]. This requires that papers in experimental science describe the results and provide a sufficiently clear protocol to allow successful repetition and extension of analyses based on original data [4]. The importance of replication and reproducibility has recently been exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction [5], studies showing difficulties with replicating published experimental results [6], an increase in retracted papers [7], and through a high number of failing clinical trials [8], [9]. This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility. In order to foster such aspects, it has been suggested that the scientific community needs to develop a “culture of reproducibility” for computational science, and to require it for published claims [3]. We want to emphasize that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducibility can also be a burden for you as an individual researcher. As an example, a good practice of reproducibility is necessary in order to allow previously developed methodology to be effectively applied on new data, or to allow reuse of code and results for new projects. In other words, good habits of reproducibility may actually turn out to be a time-saver in the longer run. We further note that reproducibility is just as much about the habits that ensure reproducible research as the technologies that can make these processes efficient and realistic. Each of the following ten rules captures a specific aspect of reproducibility, and discusses what is needed in terms of information handling and tracking of procedures. If you are taking a bare-bones approach to bioinformatics analysis, i.e., running various custom scripts from the command line, you will probably need to handle each rule explicitly. If you are instead performing your analyses through an integrated framework (such as GenePattern [10], Galaxy [11], LONI pipeline [12], or Taverna [13]), the system may already provide full or partial support for most of the rules. What is needed on your part is then merely the knowledge of how to exploit these existing possibilities. In a pragmatic setting, with publication pressure and deadlines, one may face the need to make a trade-off between the ideals of reproducibility and the need to get the research out while it is still relevant. This trade-off becomes more important when considering that a large part of the analyses being tried out never end up yielding any results. However, frequently one will, with the wisdom of hindsight, contemplate the missed opportunity to ensure reproducibility, as it may already be too late to take the necessary notes from memory (or at least much more difficult than to do it while underway). We believe that the rewards of reproducibility will compensate for the risk of having spent valuable time developing an annotated catalog of analyses that turned out as blind alleys. As a minimal requirement, you should at least be able to reproduce the results yourself. This would satisfy the most basic requirements of sound research, allowing any substantial future questioning of the research to be met with a precise explanation. Although it may sound like a very weak requirement, even this level of reproducibility will often require a certain level of care in order to be met. There will for a given analysis be an exponential number of possible combinations of software versions, parameter values, pre-processing steps, and so on, meaning that a failure to take notes may make exact reproduction essentially impossible. With this basic level of reproducibility in place, there is much more that can be wished for. An obvious extension is to go from a level where you can reproduce results in case of a critical situation to a level where you can practically and routinely reuse your previous work and increase your productivity. A second extension is to ensure that peers have a practical possibility of reproducing your results, which can lead to increased trust in, interest for, and citations of your work [6], [14]. We here present ten simple rules for reproducibility of computational research. These rules can be at your disposal for whenever you want to make your research more accessible—be it for peers or for your future self. Rule 1: For Every Result, Keep Track of How It Was Produced Whenever a result may be of potential interest, keep track of how it was produced. When doing this, one will frequently find that getting from raw data to the final result involves many interrelated steps (single commands, scripts, programs). We refer to such a sequence of steps, whether it is automated or performed manually, as an analysis workflow. While the essential part of an analysis is often represented by only one of the steps, the full sequence of pre- and post-processing steps are often critical in order to reach the achieved result. For every involved step, you should ensure that every detail that may influence the execution of the step is recorded. If the step is performed by a computer program, the critical details include the name and version of the program, as well as the exact parameters and inputs that were used. Although manually noting the precise sequence of steps taken allows for an analysis to be reproduced, the documentation can easily get out of sync with how the analysis was really performed in its final version. By instead specifying the full analysis workflow in a form that allows for direct execution, one can ensure that the specification matches the analysis that was (subsequently) performed, and that the analysis can be reproduced by yourself or others in an automated way. Such executable descriptions [10] might come in the form of simple shell scripts or makefiles [15], [16] at the command line, or in the form of stored workflows in a workflow management system [10], [11], [13], [17], [18]. As a minimum, you should at least record sufficient details on programs, parameters, and manual procedures to allow yourself, in a year or so, to approximately reproduce the results. Rule 2: Avoid Manual Data Manipulation Steps Whenever possible, rely on the execution of programs instead of manual procedures to modify data. Such manual procedures are not only inefficient and error-prone, they are also difficult to reproduce. If working at the UNIX command line, manual modification of files can usually be replaced by the use of standard UNIX commands or small custom scripts. If working with integrated frameworks, there will typically be a quite rich collection of components for data manipulation. As an example, manual tweaking of data files to attain format compatibility should be replaced by format converters that can be reenacted and included into executable workflows. Other manual operations like the use of copy and paste between documents should also be avoided. If manual operations cannot be avoided, you should as a minimum note down which data files were modified or moved, and for what purpose. Rule 3: Archive the Exact Versions of All External Programs Used In order to exactly reproduce a given result, it may be necessary to use programs in the exact versions used originally. Also, as both input and output formats may change between versions, a newer version of a program may not even run without modifying its inputs. Even having noted which version was used of a given program, it is not always trivial to get hold of a program in anything but the current version. Archiving the exact versions of programs actually used may thus save a lot of hassle at later stages. In some cases, all that is needed is to store a single executable or source code file. In other cases, a given program may again have specific requirements to other installed programs/packages, or dependencies to specific operating system components. To ensure future availability, the only viable solution may then be to store a full virtual machine image of the operating system and program. As a minimum, you should note the exact names and versions of the main programs you use. Rule 4: Version Control All Custom Scripts Even the slightest change to a computer program can have large intended or unintended consequences. When a continually developed piece of code (typically a small script) has been used to generate a certain result, only that exact state of the script may be able to produce that exact output, even given the same input data and parameters. As also discussed for rules 3 and 6, exact reproduction of results may in certain situations be essential. If computer code is not systematically archived along its evolution, backtracking to a code state that gave a certain result may be a hopeless task. This can cast doubt on previous results, as it may be impossible to know if they were partly the result of a bug or otherwise unfortunate behavior. The standard solution to track evolution of code is to use a version control system [15], such as Subversion, Git, or Mercurial. These systems are relatively easy to set up and use, and may be used to systematically store the state of the code throughout development at any desired time granularity. As a minimum, you should archive copies of your scripts from time to time, so that you keep a rough record of the various states the code has taken during development. Rule 5: Record All Intermediate Results, When Possible in Standardized Formats In principle, as long as the full process used to produce a given result is tracked, all intermediate data can also be regenerated. In practice, having easily accessible intermediate results may be of great value. Quickly browsing through intermediate results can reveal discrepancies toward what is assumed, and can in this way uncover bugs or faulty interpretations that are not apparent in the final results. Secondly, it more directly reveals consequences of alternative programs and parameter choices at individual steps. Thirdly, when the full process is not readily executable, it allows parts of the process to be rerun. Fourthly, when reproducing results, it allows any experienced inconsistencies to be tracked to the steps where the problems arise. Fifth, it allows critical examination of the full process behind a result, without the need to have all executables operational. When possible, store such intermediate results in standardized formats. As a minimum, archive any intermediate result files that are produced when running an analysis (as long as the required storage space is not prohibitive). Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds Many analyses and predictions include some element of randomness, meaning the same program will typically give slightly different results every time it is executed (even when receiving identical inputs and parameters). However, given the same initial seed, all random numbers used in an analysis will be equal, thus giving identical results every time it is run. There is a large difference between observing that a result has been reproduced exactly or only approximately. While achieving equal results is a strong indication that a procedure has been reproduced exactly, it is often hard to conclude anything when achieving only approximately equal results. For analyses that involve random numbers, this means that the random seed should be recorded. This allows results to be reproduced exactly by providing the same seed to the random number generator in future runs. As a minimum, you should note which analysis steps involve randomness, so that a certain level of discrepancy can be anticipated when reproducing the results. Rule 7: Always Store Raw Data behind Plots From the time a figure is first generated to it being part of a published article, it is often modified several times. In some cases, such modifications are merely visual adjustments to improve readability, or to ensure visual consistency between figures. If raw data behind figures are stored in a systematic manner, so as to allow raw data for a given figure to be easily retrieved, one can simply modify the plotting procedure, instead of having to redo the whole analysis. An additional advantage of this is that if one really wants to read fine values in a figure, one can consult the raw numbers. In cases where plotting involves more than a direct visualization of underlying numbers, it can be useful to store both the underlying data and the processed values that are directly visualized. An example of this is the plotting of histograms, where both the values before binning (original data) and the counts per bin (heights of visualized bars) could be stored. When plotting is performed using a command-based system like R, it is convenient to also store the code used to make the plot. One can then apply slight modifications to these commands, instead of having to specify the plot from scratch. As a minimum, one should note which data formed the basis of a given plot and how this data could be reconstructed. Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected The final results that make it to an article, be it plots or tables, often represent highly summarized data. For instance, each value along a curve may in turn represent averages from an underlying distribution. In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying the summaries. A common but impractical way of doing this is to incorporate various debug outputs in the source code of scripts and programs. When the storage context allows, it is better to simply incorporate permanent output of all underlying data when a main result is generated, using a systematic naming convention to allow the full data underlying a given summarized value to be easily found. We find hypertext (i.e., html file output) to be particularly useful for this purpose. This allows summarized results to be generated along with links that can be very conveniently followed (by simply clicking) to the full data underlying each summarized value. When working with summarized results, you should as a minimum at least once generate, inspect, and validate the detailed values underlying the summaries. Rule 9: Connect Textual Statements to Underlying Results Throughout a typical research project, a range of different analyses are tried and interpretation of the results made. Although the results of analyses and their corresponding textual interpretations are clearly interconnected at the conceptual level, they tend to live quite separate lives in their representations: results usually live on a data area on a server or personal computer, while interpretations live in text documents in the form of personal notes or emails to collaborators. Such textual interpretations are not generally mere shadows of the results—they often involve viewing the results in light of other theories and results. As such, they carry extra information, while at the same time having their necessary support in a given result. If you want to reevaluate your previous interpretations, or allow peers to make their own assessment of claims you make in a scientific paper, you will have to connect a given textual statement (interpretation, claim, conclusion) to the precise results underlying the statement. Making this connection when it is needed may be difficult and error-prone, as it may be hard to locate the exact result underlying and supporting the statement from a large pool of different analyses with various versions. To allow efficient retrieval of details behind textual statements, we suggest that statements are connected to underlying results already from the time the statements are initially formulated (for instance in notes or emails). Such a connection can for instance be a simple file path to detailed results, or the ID of a result in an analysis framework, included within the text itself. For an even tighter integration, there are tools available to help integrate reproducible analyses directly into textual documents, such as Sweave [19], the GenePattern Word add-in [4], and Galaxy Pages [20]. These solutions can also subsequently be used in connection with publications, as discussed in the next rule. As a minimum, you should provide enough details along with your textual interpretations so as to allow the exact underlying results, or at least some related results, to be tracked down in the future. Rule 10: Provide Public Access to Scripts, Runs, and Results Last, but not least, all input data, scripts, versions, parameters, and intermediate results should be made publicly and easily accessible. Various solutions have now become available to make data sharing more convenient, standardized, and accessible in particular domains, such as for gene expression data [21]–[23]. Most journals allow articles to be supplemented with online material, and some journals have initiated further efforts for making data and code more integrated with publications [3], [24]. As a minimum, you should submit the main data and source code as supplementary material, and be prepared to respond to any requests for further data or methodology details by peers. Making reproducibility of your work by peers a realistic possibility sends a strong signal of quality, trustworthiness, and transparency. This could increase the quality and speed of the reviewing process on your work, the chances of your work getting published, and the chances of your work being taken further and cited by other researchers after publication [25].

0 comments Cited 271 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Challenges and opportunities of open data in ecology.

O. Reichman, Matthew Jones, Mark Schildhauer (2011)

Ecology is a synthetic discipline benefiting from open access to data from the earth, life, and social sciences. Technological challenges exist, however, due to the dispersed and heterogeneous nature of these data. Standardization of methods and development of robust metadata can increase data access but are not sufficient. Reproducibility of analyses is also important, and executable workflows are addressing this issue by capturing data provenance. Sociological challenges, including inadequate rewards for sharing data, must also be resolved. The establishment of well-curated, federated data repositories will provide a means to preserve data while promoting attribution and acknowledgement of its use.

0 comments Cited 202 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network.

Melissa Basford, Peggy L. Peissig, Jodi L. Berg … (2013)

Genetic studies require precise phenotype definitions, but electronic medical record (EMR) phenotype data are recorded inconsistently and in a variety of formats. To present lessons learned about validation of EMR-based phenotypes from the Electronic Medical Records and Genomics (eMERGE) studies. The eMERGE network created and validated 13 EMR-derived phenotype algorithms. Network sites are Group Health, Marshfield Clinic, Mayo Clinic, Northwestern University, and Vanderbilt University. By validating EMR-derived phenotypes we learned that: (1) multisite validation improves phenotype algorithm accuracy; (2) targets for validation should be carefully considered and defined; (3) specifying time frames for review of variables eases validation time and improves accuracy; (4) using repeated measures requires defining the relevant time period and specifying the most meaningful value to be studied; (5) patient movement in and out of the health plan (transience) can result in incomplete or fragmented data; (6) the review scope should be defined carefully; (7) particular care is required in combining EMR and research data; (8) medication data can be assessed using claims, medications dispensed, or medications prescribed; (9) algorithm development and validation work best as an iterative process; and (10) validation by content experts or structured chart review can provide accurate results. Despite the diverse structure of the five EMRs of the eMERGE sites, we developed, validated, and successfully deployed 13 electronic phenotype algorithms. Validation is a worthwhile process that not only measures phenotype performance but also strengthens phenotype algorithm definitions and enhances their inter-institutional sharing.

0 comments Cited 149 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput. Biol

Journal ID (publisher-id): plos

Journal ID (pmc): ploscomp

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Electronic): 19 January 2017

Publication date Collection: January 2017

Volume: 13

Issue: 1

Electronic Location Identifier: e1005278

Affiliations

[1 ]Department of Biomedical Informatics, Columbia University, New York, New York, United States of America

[2 ]Department of Systems Biology, Columbia University, New York, New York, United States of America

[3 ]Department of Medicine, Columbia University, New York, New York, United States of America

[4 ]Observational Health Data Sciences and Informatics, Columbia University, New York, New York, United States of America

[5 ]Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America

[6 ]Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts, United States of America

Author notes

The authors have declared that no competing interests exist.

* E-mail: mary.boland@ 123456columbia.edu

Author information

Mary Regina Boland http://orcid.org/0000-0001-8576-6408

Konrad J. Karczewski http://orcid.org/0000-0003-2878-4671

Nicholas P. Tatonetti http://orcid.org/0000-0002-2700-2597

Article

Publisher ID: PCOMPBIOL-D-16-01506

DOI: 10.1371/journal.pcbi.1005278

PMC ID: 5245793

PubMed ID: 28103227

SO-VID: cd7ef60a-dbf1-4ab5-948c-7fe90837df4c

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Page count

Figures: 1, Tables: 1, Pages: 12

Funding

MRB was supported by NLM T15 LM00707 from Jul 2014–Jun 2016 and by the NCATS, NIH, through TL1 TR000082, formerly the NCRR, TL1 RR024158 from Jul 2016–Jun 2017. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Ten Simple Rules to Enable Multi-site Collaborations through Data Sharing

Read this article at

Abstract

Related collections

Journal of Systems Thinking

Most cited references 14

Ten Simple Rules for Reproducible Computational Research

Challenges and opportunities of open data in ecology.

Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network.

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 4

Cited by 22

Most referenced authors 936