Shortcomings of SARS-CoV-2 genomic metadata

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Objective

The SARS-CoV-2 pandemic has prompted one of the most extensive and expeditious genomic sequencing efforts in history. Each viral genome is accompanied by a set of metadata which supplies important information such as the geographic origin of the sample, age of the host, and the lab at which the sample was sequenced, and is integral to epidemiological efforts and public health direction. Here, we interrogate some shortcomings of metadata within the GISAID database to raise awareness of common errors and inconsistencies that may affect data-driven analyses and provide possible avenues for resolutions.

Results

Our analysis reveals a startling prevalence of spelling errors and inconsistent naming conventions, which together occur in an estimated ~ 9.8% and ~ 11.6% of “originating lab” and “submitting lab” GISAID metadata entries respectively. We also find numerous ambiguous entries which provide very little information about the actual source of a sample and could easily associate with multiple sources worldwide. Importantly, all of these issues can impair the ability and accuracy of association studies by deceptively causing a group of samples to identify with multiple sources when they truly all identify with one source, or vice versa.

Related collections

Most cited references 22

Record: found
Abstract: found
Article: found

Is Open Access

The FAIR Guiding Principles for scientific data management and stewardship

Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg … (2016)

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

0 comments Cited 2940 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Nextstrain: real-time tracking of pathogen evolution

James Hadfield, Colin Megill, Sidney Bell … (2018)

Abstract Summary Understanding the spread and evolution of pathogens is important for effective public health measures and surveillance. Nextstrain consists of a database of viral genomes, a bioinformatics pipeline for phylodynamics analysis, and an interactive visualization platform. Together these present a real-time view into the evolution and spread of a range of viral pathogens of high public health importance. The visualization integrates sequence data with other data types such as geographic information, serology, or host species. Nextstrain compiles our current understanding into a single accessible location, open to health professionals, epidemiologists, virologists and the public alike. Availability and implementation All code (predominantly JavaScript and Python) is freely available from github.com/nextstrain and the web-application is available at nextstrain.org.

0 comments Cited 1282 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

GISAID: Global initiative on sharing all influenza data – from vision to reality

Yuelong Shu, John A. McCauley (2017)

Ten years ago, a correspondence [1,2], signed by more than 70 championed ‘A global initiative on sharing avian flu data’ (GISAID) [3], leading to the GISAID Initiative in 2008. What started out as an expression of intent to foster international sharing of all influenza virus data and to publish results collaboratively has emerged as an indispensable mechanism for sharing influenza genetic sequence and metadata that embraces the interests and concerns of the wider influenza community, public health and animal health scientists, along with governments around the world. Today GISAID is recognised as an effective and trusted mechanism for rapid sharing of both published and ‘unpublished’ influenza data [4]. Its concept for incentivising data sharing established an alternative to data sharing via conventional public-domain archives. In 2006, the reluctance of data sharing, in particular of avian H5N1 influenza viruses, created an emergency bringing into focus certain limitations and inequities, such that the World Health Organization (WHO)’s Global Influenza Surveillance Network (now the Global Influenza Surveillance and Response System (GISRS) [5]) was criticised on several fronts, including limited global access to H5N1 sequence data that were stored in a database hosted by the Los Alamos National Laboratories in the United States (US) [6,7]. This data repository, set up with financial support from the US Centers for Disease Control and Prevention (CDC) as a first attempt to share ‘sensitive’ data from affected countries, but was accessible only to those who were also providing H5N1 sequence data. This limited-access approach restricted wider sharing of data prior to publication, which was vital for broader understanding of the progress of the emergent public and animal health threat. The need for greater transparency in data sharing and for acknowledgement of those contributing samples from H5N1-infected patients and animals and related genetic sequence data was not satisfied by sharing data after formal publication via public-domain databases. Scientists charged with the day to day responsibilities of running WHO Collaborating Centres (CCs) for Influenza, National Influenza Centres and the World Organisation for Animal Health (OIE)/ Food and Agriculture Organization of the United Nations (FAO) [8] reference laboratories, were therefore eager to play a key role and provide scientific oversight in the creation and development of GISAID’s data sharing platform that soon became essential for our work. A unique collaboration ensued, involving, in addition to members of WHO’s GISRS and OIE/FAO reference laboratories, the wider influenza research community along with officials in governmental institutions and non-governmental organisations. Facilitated by a well-connected broadcast executive with background in licensing of intellectual property, an agreement was drawn up on the sharing of genetic data to meet emergency situations, without infringing intellectual property rights - the GISAID Database Access Agreement (DAA). The DAA governs each individual’s access to and their use of data in GISAID’s EpiFlu database [9]. It was this alliance between scientists and non-scientists, with a diversity of knowledge and experience, involved in drawing up an acceptable simple, yet enforceable, agreement which gained the trust and respect of the scientific community and public health and animal health authorities. The essential features of the DAA encourage sharing of data by securing the provider’s ownership of the data, requiring acknowledgement of those providing the samples and producing the data, while placing no restriction on the use of the data by registered users adhering to the DAA. It essentially defines a code of conduct between providers and users of data, cementing mutual respect for their respective complementary contributions, and upholding the collaborative ethos of WHO’s GISRS, initially established 65 years ago this year [5]. Launched in 2008, the EpiFlu database was of key importance in the response to the 2009 influenza A(H1N1) pandemic, allowing countries to readily follow the evolution of the new virus as it spread globally [10]. Acceptance of the GISAID sharing mechanism by providers and users of data, and the confidence of the influenza community, were further illustrated in 2013 by the unprecedented immediate release of the genetic sequences of Influenza A(H7N9) viruses from the first human cases, by Chinese scientists at the WHO Collaborating Centre for Influenza in Beijing [11,12]. Such events reaffirmed GISAID’s applicability to timely sharing of crucial influenza data. The subsequent use of the sequence data to generate, develop and test candidate vaccine viruses by synthetic biology within a few weeks also demonstrated how GISAID successfully bridged this important ‘technological’ gap [13,14]. The paper by Bao et al. from Jiangsu province of China published in this issue once again confirms the importance of the timely sharing of data on the evolution of the A(H7N9) viruses for global risk assessment. The authors analysed the recently isolated H7N9 viruses form the fifth wave in Jiangsu province, and the results showed no significant viral mutations in key functional loci even though the H7N9 viruses are under continuous dynamic reassortment and there is genetic heterogeneity. These findings should help to reduce concerns raised, even though the number of human infection with H7N9 virus increased sharply during the fifth wave in China. GISAID provides the data-sharing platform particularly used by GISRS, through which sequence data considered by the WHO CCs in selecting viruses recommended for inclusion in seasonal and pre-pandemic vaccines are shared openly and on which research scientists, public and animal health officials and the pharmaceutical industry depend. Such openness of the most up-to-date data assists in an understanding of and enhances the credibility of the WHO recommendations for the composition of these seasonal and potential-pandemic vaccines. Furthermore, in promoting the prompt sharing of data from potential pandemic zoonotic virus infections, as well as from seasonal influenza viruses, GISAID ensures a key tenet of the WHO Pandemic Influenza Preparedness (PIP) Framework [15], highlighting the critical role it plays in mounting an effective mitigating response. GISAID’s ability to facilitate efficient global collaborations, such as the Global Consortium for H5N8 and Related Influenza Viruses [16,17], is central to monitoring phylogeographic interrelationships among, for example, H5 subtype viruses in wild and domestic birds in relation to their incidence, cross-border spread and veterinary impact, and assessing risk to animal and human health [18]. Traditional public-domain archives such as GenBank, where sharing and use of data takes place anonymously, fulfil a need for an archive of largely published data; however, that conventional method of data exchange notably has not been successful in encouraging rapid sharing of important data in epidemic or (potential) pandemic situations, such as those caused by Middle East respiratory syndrome coronavirus (MERS-CoV) and Ebola viruses. While the GISAID EpiFlu database is hosted and its sustainability ensured through the commitment of the Federal Republic of Germany [19], the establishment of GISAID and development of the EpiFlu database was reliant to a large extent on philanthropy of one individual and voluntary contributions and generosity of many others, together with some initial financial provision by the US CDC and the German Max Planck Society. That GISAID has become accepted as a pragmatic means of meeting the needs of the influenza community in part reflects the particular characteristics of influenza and the continual need for year-round monitoring of the viruses circulating worldwide, essential for the biannual vaccine recommendations and assessment of the risk posed by frequent zoonotic infections by animal influenza viruses [20]. In the meantime, calls for an equivalent mechanism to promote the timely sharing of data in other urgent epidemic settings go largely unfulfilled [21,22]. A recent publication considered whether the ‘paradigm shift’ in data sharing by GISAID could be applied more generally to assist in preparedness for and response to other emergent infectious threats, such as those posed by Ebola virus [21] and Zika virus [23]. Such a trusted system could complement and take full advantage of the latest advances in rapid sequencing of specimens in the laboratory and in the field, for outbreak investigation [24]. Given the crucial importance of genetic data in improving our understanding of the progress of an emergent, potentially devastating epidemic, the effectiveness of GISAID in influenza pandemic preparedness is self-evident and provides important lessons for future pandemic threats. While the genetic makeup and the necessary associated data of the different viruses are distinct requiring separate databases/compartments for unambiguous analysis, the modi operandi for sharing genetic data are generic and the GISAID mechanism could be applied to other emerging pathogens. Indeed, the wider implementation of such a data sharing mechanism should be key in concerted efforts to contain spread of disease in animals and threats to human health, in realising the concept of One Health.

0 comments Cited 1167 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Landen Gozashti:

ORCID: http://orcid.org/0000-0001-6023-3138

lgozashti@g.harvard.edu

Journal

Journal ID (nlm-ta): BMC Res Notes

Journal ID (iso-abbrev): BMC Res Notes

Title: BMC Research Notes

Publisher: BioMed Central (London )

ISSN (Electronic): 1756-0500

Publication date (Electronic): 17 May 2021

Publication date PMC-release: 17 May 2021

Publication date Collection: 2021

Volume: 14

Electronic Location Identifier: 189

Affiliations

[1 ]GRID grid.38142.3c, ISNI 000000041936754X, Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, , Harvard University, ; Cambridge, MA 02138 USA

[2 ]GRID grid.205975.c, ISNI 0000 0001 0740 6917, Department of Biomolecular Engineering and Genomics Institute, , University of California Santa Cruz, ; Santa Cruz, CA 95064 USA

Author information

Landen Gozashti http://orcid.org/0000-0001-6023-3138

Article

Publisher ID: 5605

DOI: 10.1186/s13104-021-05605-9

PMC ID: 8128092

PubMed ID: 34001211

SO-VID: 750e6282-cb94-4b98-8761-e17cbecf21d5

License:

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

History

Date received : 19 February 2021

Date accepted : 6 May 2021

Custom metadata

ScienceOpen disciplines: Medicine

Keywords: sars-cov-2,metadata,genomics,databases,data quality,covid-19

Data availability:

ScienceOpen disciplines: Medicine

Keywords: sars-cov-2, metadata, genomics, databases, data quality, covid-19

Comments

Comment on this article

scite_

Cited by 10

See all cited by

Most referenced authors 581

See all reference authors

Shortcomings of SARS-CoV-2 genomic metadata

Read this article at

Abstract

Objective

Results

Related collections

Novel Coronavirus Disease COVID-19

Most cited references 22

The FAIR Guiding Principles for scientific data management and stewardship

Nextstrain: real-time tracking of pathogen evolution

GISAID: Global initiative on sharing all influenza data – from vision to reality

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 194

Cited by 10

Most referenced authors 581