439
views
2
recommends
+1 Recommend
2 collections
    4
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Achieving human and machine accessibility of cited data in scholarly publications

      Read this article at

      ScienceOpenPublisherPubMed
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Reproducibility and reusability of research results is an important concern in scientific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data. However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies. This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class scholarly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature. Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP). We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data. The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations. But ordinary researchers can also benefit from these recommendations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly improved verification, validation, reproducibility and re-use of scholarly/scientific data.

          Related collections

          Most cited references 26

          • Record: found
          • Abstract: not found
          • Article: not found

          Principled design of the modern Web architecture

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The case for cloud computing in genome informatics

            The impending collapse of the genome informatics ecosystem Since the 1980s, we have had the great fortune to work in a comfortable and effective ecosystem for the production and consumption of genomic information (Figure 1). Sequencing labs submit their data to big archival databases such as GenBank at the National Center for Biotechnology Information (NCBI) [1], the European Bioinformatics Institute EMBL database [2], DNA Data Bank of Japan (DDBJ) [3], the Short Read Archive (SRA) [4], the Gene Expression Omnibus (GEO) [5] and the microarray database ArrayExpress [6]. These databases maintain, organize and distribute the sequencing data. Most users access the information either through websites created by the archival databases, or through value-added integrators of genomic data, such as Ensembl [7], the University of California at Santa Cruz (UCSC) Genome Browser [8], Galaxy [9], or one of the many model organism databases [10-13]. Bioinformaticians and other power users download genomic data from these primary and secondary sources to their high performance clusters of computers ('compute clusters'), work with them and discard them when no longer needed (Figure 1). Figure 1 The old genome informatics ecosystem. Under the traditional flow of genome information, sequencing laboratories transmit raw and interpreted sequencing information across the internet to one of several sequencing archives. This information is accessed either directly by casual users or indirectly via a website run by one of the value-added genome integrators. Power users typically download large datasets from the archives onto their local compute clusters for computationally intensive number crunching. Under this model, the sequencing archives, value-added integrators and power users all maintain their own compute and storage clusters and keep local copies of the sequencing datasets. The whole basis for this ecosystem is Moore's Law [14], a long-term trend first described in 1965 by Intel co-founder Gordon Moore. Moore's Law states that the number of transistors that can be placed on an integrated circuit board is increasing exponentially, with a doubling time of roughly 18 months. The trend has held up remarkably well for 35 years across multiple changes in semiconductor technology and manufacturing techniques. Similar laws for disk storage and network capacity have also been observed. Hard disk capacity doubles roughly annually (Kryder's Law [15]), and the cost of sending a bit of information over optical networks halves every 9 months (Butter's Law [16]). Genome sequencing technology has also improved dramatically, and the number of bases that can be sequenced per unit cost has also been growing at an exponential rate. However, until just a few years ago, the doubling time for DNA sequencing was just a bit slower than the growth of compute and storage capacity. This was great for the genome informatics ecosystem. The archival databases and the value-added genome distributors did not need to worry about running out of disk storage space because the long-term trends allowed them to upgrade their capacity faster than the world's sequencing labs could update theirs. Computational biologists did not worry about not having access to sufficiently powerful networks or compute clusters because they were always slightly ahead of the curve. However, the advent of 'next generation' sequencing technologies in the mid-2000s changed these long-term trends and now threatens the conventional genome informatics ecosystem. To illustrate this, I recently plotted long-term trends in hard disk prices and DNA sequencing prices by using the Internet Archive's 'Wayback Machine' [17], which keeps archives of websites as they appeared in the past, to view vendors' catalogs, websites and press releases as they appeared over the past 20 years (Figure 2). Notice that this is a logarithmic plot, so exponential curves appear as straight lines. I made no attempt to factor in inflation or to calculate the cost of DNA sequencing with labor and overheads included, but the trends are clear. From 1990 to 2010, the cost of storing a byte of data has halved every 14 months, consistent with Kryder's Law. From 1990 to 2004, the cost of sequencing a base decreased more slowly than this, halving every 19 months - good news if you are running the bioinformatics core for a genome sequencing center. Figure 2 Historical trends in storage prices versus DNA sequencing costs. The blue squares describe the historic cost of disk prices in megabytes per US dollar. The long-term trend (blue line, which is a straight line here because the plot is logarithmic) shows exponential growth in storage per dollar with a doubling time of roughly 1.5 years. The cost of DNA sequencing, expressed in base pairs per dollar, is shown by the red triangles. It follows an exponential curve (yellow line) with a doubling time slightly slower than disk storage until 2004, when next generation sequencing (NGS) causes an inflection in the curve to a doubling time of less than 6 months (red line). These curves are not corrected for inflation or for the 'fully loaded' cost of sequencing and disk storage, which would include personnel costs, depreciation and overhead. However, from 2005 the slope of the DNA sequencing curve increases abruptly. This corresponds to the advent of the 454 Sequencer [18], quickly followed by the Solexa/Illumina [19] and ABI SOLiD [20] technologies. Since then, the cost of sequencing a base has been dropping by half every 5 months. The cost of genome sequencing is now decreasing several times faster than the cost of storage, promising that at some time in the not too distant future it will cost less to sequence a base of DNA than to store it on a hard disk. Of course there is no guarantee that this accelerated trend will continue indefinitely, but recent and announced offerings from Illumina [21], Pacific Biosystems [22], Helicos [23] and Ion Torrent [24], among others, promise to continue the trend until the middle of the decade. This change in the long-term trend overthrows the assumptions that support the current ecosystem. The various members of the genome informatics ecosystem are now facing a potential tsunami of genome data that will swamp our storage systems and crush our compute clusters. Just consider this one statistic: the first big genome project based on next generation sequencing technologies, the 1000 Genomes Project [25], which is cataloguing human genetic variation, deposited twice as much raw sequencing data into GenBank's SRA division during the project's first 6 months of operation as had been deposited into all of GenBank for the entire 30 years preceding (Paul Flicek, personal communication). But the 1000 Genomes Project is just the first ripple of the tsunami. Projects like ENCODE [26] and modENCODE [27], which use next generation sequencing for high-resolution mapping of epigenetic marks, chromatin-binding proteins and other functional elements, are currently generating raw sequence at tremendous rates. Cancer genome projects such as The Cancer Genome Atlas [28] and the International Cancer Genome Sequencing Consortium [29] are an order of magnitude larger than the 1000 Genomes Project, and the various Human Microbiome Projects [30,31] are potentially even larger still. Run for the hills? First, we must face up to reality. The ability of laboratories around the world to produce sequence faster and more cheaply than information technology groups can upgrade their storage systems is a fundamental challenge that admits no easy solution. At some future point it will become simply unfeasible to store all raw sequencing reads in a central archive or even in local storage. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest. Even though raw read sets may not be preserved in their entirety, it will remain imperative for the assembled genomes of animals, plants and ecological communities to be maintained in publicly accessible form. But these are also rapidly growing in size and complexity because of the drop in sequencing costs and the growth of derivative technologies such as chromatin immunoprecipitation with sequencing (ChIP-seq [32]), DNA methylation sequencing [33] and chromatin interaction mapping [34]. These large datasets pose significant challenges for both the primary and secondary genome sequence repositories who must maintain the data, as well as the 'power users' who are accustomed to downloading the data to local computers for analysis. Reconsider the traditional genome informatics ecosystem of Figure 1. It is inefficient and wasteful in several ways. For the value-added genome integrators to do their magic with the data, they must download it from the archival databases across the internet and store copies in their local storage systems. The power users must do the same thing: either downloading the data directly from the archive, or downloading it from one of the integrators. This entails moving the same datasets across the network repeatedly and mirroring them in multiple local storage systems. When datasets are updated, each of the mirrors must detect that fact and refresh their copies. As datasets get larger, this process of mirroring and refreshing becomes increasingly cumbersome, error prone and expensive. A less obvious inefficiency comes from the need of the archives, integrators and power users to maintain local compute clusters to meet their analysis needs. NCBI, UCSC and the other genome data providers maintain large server farms that process genome data and serve it out via the web. The load on the server farm fluctuates hourly, daily and seasonally. At any time, a good portion of their clusters is sitting idle, waiting in reserve for periods of peak activity when a big new genome dataset comes in, or a major scientific meeting is getting close. However, even though much of the cluster is idle, it still consumes electricity and requires the care of a systems administration staff. Bioinformaticians and other computational biologists face similar problems. They can choose between building a cluster that is adequate to meet their everyday needs, or build one with the capacity to handle peak usage. In the former case, the researcher risks being unable to run an unusually involved analysis in reasonable running time and possibly being scooped by a competitor. In the latter case, they waste money purchasing and maintaining a system that they are not using to capacity much of the time. These inefficiencies have been tolerable in a world in which most genome-scale datasets have fit on a DVD (uncompressed, the human genome is about 3 gigabytes). When datasets are measured in terabytes these inefficiencies add up. Cloud computing to the rescue Which brings us, at last, to 'cloud computing.' This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational biologists currently work depends on the concept of a 'virtual machine'. In the traditional economic model of computation, customers purchase server, storage and networking hardware, configure it the way they need, and run software on it. In computation-as-a-service, customers essentially rent the hardware and storage for as long or as short a time as they need to achieve their goals. Customers pay only for the time the rented systems are running and only for the storage they actually use. This model would be lunatic if the rented machines were physical ones. However, in cloud computing, the rentals are virtual: without ever touching a power cable, customers can power up a fully functional 10-computer server farm with a terabyte of shared storage, upgrade the cluster in minutes to 100 servers when needed for some heavy duty calculations, and then return to the baseline 10-server system when the extra virtual machines are no longer needed. The way it works is that a service provider puts up the capital expenditure of creating an extremely large compute and storage farm (tens of thousands of nodes and petabytes of storage) with all the frills needed to maintain an operation of this size, including a dedicated system administration staff, storage redundancy, data centers distributed to strategically placed parts of the world, and broadband network connectivity. The service provider then implements the infrastructure to give users the ability to create, upload and launch virtual machines on this compute farm. Because of economies of scale, the service provider can obtain highly discounted rates on hardware, electricity and network connectivity, and can pass these savings on to the end users to make virtual machine rental economically competitive with purchasing the real thing. A virtual machine is a piece of software running on the host computer (the real hardware) that emulates the properties of a computer: the emulator provides a virtual central processing unit (CPU), network card, hard disk, keyboard and so forth. You can run the operating system of your choice on the virtual machine, log into it remotely via the internet, configure it to run web servers, databases, load management software, parallel computation libraries, and any other software you favor. You may be familiar with virtual machines from working with consumer products such as VMware [35] or open source projects such as KVM [36]. A single physical machine can host multiple virtual machines, and software running on the physical server farm can distribute requests for new virtual machines across the server farm in a way that intelligently distributes load. The experience of working with virtual machines is relatively painless. Choose the physical aspects of the virtual machine you wish to make, including CPU type, memory size and hard disk capacity, specify the operating system you wish to run, and power up one or more machines. Within a couple of minutes, your virtual machines are up and running. Log into them over the network and get to work. When a virtual machine is not running, you can store an image of its bootable hard disk. You can then use this image as a template on which to start up multiple virtual machines, which is how you can launch a virtual compute cluster in a matter of minutes. For the field of genome informatics, a key feature of cloud computing is the ability of service providers and their customers to store large datasets in the cloud. These datasets typically take the form of virtual disk images that can be attached to virtual machines as local hard disks and/or shared as networked volumes. For example, the entire GenBank archive could be (and in fact is, see below) stored in the cloud as a disk image that can be loaded and unloaded as needed. Figure 3 shows what the genome informatics ecosystem might look like in a cloud computing environment. Here, instead of there being separate copies of genome datasets stored at diverse locations and groups copying the data to their local machines in order to work with them, most datasets are stored in the cloud as virtual disks and databases. Web services that run on top of these datasets, including both the primary archives and the value-added integrators, run as virtual machines within the cloud. Casual users, who are accustomed to accessing the data via the web pages at NCBI, DDBJ, Ensembl or UCSC, continue to work with the data in their accustomed way; the fact that these servers are now located inside the cloud is invisible to them. Figure 3 The 'new' genome informatics ecosystem based on cloud computing. In this model, the community's storage and compute resources are co-located in a 'cloud' maintained by a large service provider. The sequence archives and value-added integrators maintain servers and storage systems within the cloud, and use more or less capacity as needed for daily and seasonal fluctuations in usage. Casual users continue to access the data via the websites of the archives and integrators, but power users now have the option of creating virtual on-demand compute clusters within the cloud, which have direct access to the sequencing datasets. Power users can continue to download the data, but they now have an attractive alternative. Instead of moving the data to the compute cluster, they move the compute cluster to the data. Using the facilities provided by the service provider, they configure a virtual machine image that contains the software they wish to run, launch as many copies as they need, mount the disks and databases containing the public datasets they need, and do the analysis. When the job is complete, their virtual cluster sends them the results and then vanishes until it is needed again. Cloud computing also creates a new niche in the ecosystem for genome software developers to package their work in the form of virtual machines. For example, many genome annotation groups have developed pipelines for identifying and classifying genes and other functional elements. Although many of these pipelines are open source, packaging and distributing them for use by other groups has been challenging given their many software dependencies and site-specific configuration options. In a cloud computing environment these pipelines can be packaged into virtual machine images and stored in a way that lets anyone copy them, run them and customize them for their own needs, thus avoiding the software installation and configuration complexities. But will it work? Cloud computing is real. The earliest service provider to realize a practical cloud computing environment was Amazon, with its Elastic Cloud Computing (EC2) service [37] introduced in 2005. It supports a variety of Linux and Windows virtual machines, a virtual storage system, and mechanisms for managing internet protocol (IP) addresses. Amazon also provides a virtual private network service that allows organizations with their own compute resources to extend their local area network into Amazon's cloud to create what is sometimes called a 'hybrid' cloud. Other service providers, notably Rackspace Cloud [38] and Flexiant [39], offer cloud services with similar overall functionality but many distinguishing differences of detail. As of today, you can establish an account with Amazon Web Services or one of the other commercial vendors, launch a virtual machine instance from a wide variety of generic and bioinformatics-oriented images and attach any one of several large public genome-oriented datasets. For virtual machine images, you can choose images prepopulated with Galaxy [40], a powerful web-based system for performing many common genome analysis tasks, Bioconductor [41], a programming environment that is integrated with the R statistics package [42], GBrowse [43], a genome browser, BioPerl [44], a comprehensive set of bioinformatics modules written in the Perl programming language, JCVI Cloud BioLinux [45], a collection of bioinformatics tools including the Celera Assembler, and a variety of others. Several images that run specialized instances of the UCSC Genome Browser are under development [46]. In addition to these useful images, Amazon provides several large genomic datasets in its cloud. These include a complete copy of GenBank (200 gigabytes), the 30× coverage sequencing reads of a trio of individuals from the 1000 Genomes Project (700 gigabytes) and the genome databases from Ensembl, which includes the annotated genomes of human and 50 other species (150 gigabytes of annotations plus 100 gigabytes of sequence). These datasets were contributed to Amazon's repository of public datasets by a variety of institutions and can be attached to virtual machine images for a nominal fee. There are also a growing number of academic compute cloud projects based on open source cloud management software, such as Eucalyptus [47]. One such project is the Open Cloud Consortium [48], with participants from a group of American universities and industrial partners; another is the Cloud Computing University Initiative, an effort initiated by IBM and Google in partnership with a series of academic institutions [49], and supplemented by grants from the US National Science Foundation [50], for use by themselves and the community. Academic clouds may in fact be a better long-term solution for genome informatics than using a commercial system, because genome computing has requirements for high data read and write speeds that are quite different from typical business applications. Academic clouds will likely be able to tune their performance characteristics to the needs of scientific computing. The economics of cloud computing Is this change in the ecosystem really going to happen? There are some significant downsides to moving genomics into the cloud. An important one is the cost of migrating existing systems into an environment that is unlike what exists today. Both the genome databases and the value-added integrators will need to make significant changes in their standard operating procedures and their funding models as capital expenditures are shifted into recurrent costs; genomics power users will also need to adjust to the new paradigm. Another issue that needs to be dealt with is how to handle potentially identifiable genetic data, such as that produced by whole genome association studies or disease sequencing projects. These data are currently stored in restricted-access databases. In order to move such datasets into a public cloud operated by Amazon or another service provider, they will have to be encrypted before entering the cloud and a layer of software developed that allows authorized users access to them. Such a system would be covered by a variety of privacy regulations and would take time to get right at both the technological and the legal level. Then there is the money question. Does cloud computing make economic sense for genomics? It is difficult to make blanket conclusions about the relative costs of renting versus buying computational services, but a good discussion of the issues can be found in a technical report on Cloud Computing published about a year ago by the UC Berkeley Reliable Adaptive Distributed Systems Laboratory [51]. The conclusion of this report is that when all the costs of running a data center are factored in, including hardware depreciation, electricity, cooling, network connectivity, service contracts and administrator salaries, the cost of renting a data center from Amazon is marginally more expensive than buying one. However, when the flexibility of the cloud to support a virtual data center that shrinks and grows as needed is factored in, the economics start to look downright good. For genomics, the biggest obstacle to moving to the cloud may well be network bandwidth. A typical research institution will have network bandwidth of about a gigabit/second (roughly 125 megabytes/second). On a good day this will support sustained transfer rates of 5 to 10 megabytes/second across the internet. Transferring a 100 gigabyte next-generation sequencing data file across such a link will take about a week in the best case. A 10 gigabit/second connection (1.25 gigabytes/second), which is typical for major universities and some of the larger research institutions, reduces the transfer time to under a day, but only at the cost of hogging much of the institution's bandwidth. Clearly cloud services will not be used for production sequencing any time soon. If cloud computing is to work for genomics, the service providers will have to offer some flexibility in how large datasets get into the system. For instance, they could accept external disks shipped by mail the way that the Protein Database [52] once accepted atomic structure submissions on tape and floppy disk. In fact, a now-defunct Google initiative called Google Research Datasets once planned to collect large scientific datasets by shipping around 3-terabyte disk arrays [53]. The reversal of the advantage that Moore's Law has had over sequencing costs will have long-term consequences for the field of genome informatics. In my opinion the most likely outcome is to turn the current genome analysis paradigm on its head and force the software to come to the data rather than the other way around. Cloud computing is an attractive technology at this critical juncture.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Bioinformatics challenges of new sequencing technology.

              New DNA sequencing technologies can sequence up to one billion bases in a single day at low cost, putting large-scale sequencing within the reach of many scientists. Many researchers are forging ahead with projects to sequence a range of species using the new technologies. However, these new technologies produce read lengths as short as 35-40 nucleotides, posing challenges for genome assembly and annotation. Here we review the challenges and describe some of the bioinformatics systems that are being proposed to solve them. We specifically address issues arising from using these technologies in assembly projects, both de novo and for resequencing purposes, as well as efforts to improve genome annotation in the fragmented assemblies produced by short read lengths.
                Bookmark

                Author and article information

                Contributors
                Journal
                peerj-cs
                PeerJ Computer Science
                PeerJ Comput. Sci.
                PeerJ Inc. (San Francisco, USA )
                2376-5992
                27 May 2015
                : 1
                Affiliations
                [1 ]California Digital Library , Oakland, CA, United States of America
                [2 ]Institute of Quantitative Social Sciences, Harvard University , Cambridge, MA, United States of America
                [3 ]Stanford University School of Medicine , Stanford, CA, United States of America
                [4 ]Center for International Earth Science Information Network (CIESIN), Columbia University , Palisades, NY, United States of America
                [5 ]National Snow and Ice Data Center , Boulder, CO, United States of America
                [6 ]ORCID, Inc. , Bethesda, MD, United States of America
                [7 ]Oregon Health and Science University , Portland, OR, United States of America
                [8 ]World Wide Web Consortium (W3C)/Centrum Wiskunde en Informatica (CWI) , Amsterdam, Netherlands
                [9 ]ICSU Committee on Data for Science and Technology (CODATA) , Paris, France
                [10 ]Solar Data Analysis Center, NASA Goddard Space Flight Center , Greenbelt, MD, United States of America
                [11 ]Public Library of Science , San Francisco, CA, United States of America
                [12 ]European Organization for Nuclear Research (CERN) , Geneva, Switzerland
                [13 ]Columbia University Libraries/Information Services , New York, NY, United States of America
                [14 ]SBA Research , Vienna, Austria
                [15 ]Institute of Software Technology and Interactive Systems, Vienna University of Technology/TU Wien , Austria
                [16 ]American Physical Society , Ridge, NY, United States of America
                [17 ]Elsevier , Oxford, United Kingdom
                [18 ]Harvard Medical School , Boston, MA, United States of America
                Article
                cs-1
                10.7717/peerj-cs.1
                26167542

                This is an open access article, free of all copyright, made available under the Creative Commons Public Domain Dedication. This work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

                Product
                Self URI (journal-page): https://peerj.com/computer-science/
                Funding
                Funded by: National Institutes of Health (NIH)
                Award ID: # NIH 1U54AI117925-01
                Funded by: Alfred P. Sloan Foundation
                Award ID: #2012-3-23
                Funded by: European Union (FP7)
                Award ID: #269977
                Award ID: #269940
                Funded by: National Aeronautics and Space Administration (NASA)
                Award ID: NNG13HQ04C
                This work was funded in part by generous grants from the US National Institutes of Health and National Aeronautics and Space Administration, the Alfred P. Sloan Foundation, and the European Union (FP7). Support from the National Institutes of Health (NIH) was provided via grant # NIH 1U54AI117925-01 in the Big Data to Knowledge program, supporting the Center for Expanded Data Annotation and Retrieval (CEDAR). Support from the National Aeronautics and Space Administration (NASA) was provided under Contract NNG13HQ04C for the Continued Operation of the Socioeconomic Data and Applications Center (SEDAC). Support from The Alfred P. Sloan Foundation was provided under two grants: a. Grant # 2012-3-23 to the Harvard Institute for Quantitative Social Sciences, “Helping Journals to Upgrade Data Publication for Reusable Research”; and b. a grant to the California Digital Library, “CLIR/DLF Postdoctoral Fellowship in Data Curation for the Sciences and Social Sciences”. The European Union partially supported this work under the FP7 contracts #269977 supporting the Alliance for Permanent Access and #269940 supporting Digital Preservation for Timeless Business Processes and Services. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Human–Computer Interaction
                Data Science
                Digital Libraries
                World Wide Web and Web Science

                Computer science

                Data archiving, Machine accessibility, Data citation, Data accessibility

                Comments

                Comment on this article