77
views
0
recommends
+1 Recommend
1 collections
    1
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      SeqScrub: a web tool for automatic cleaning and annotation of FASTA file headers for bioinformatic applications

      research-article

      Read this article at

      ScienceOpenPublisherPubMed
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Data consistency is necessary for effective bioinformatic analysis. SeqScrub is a web tool that parses and maintains consistent information about protein and DNA sequences in FASTA file format, checks if records are current, and adds taxonomic information by matching identifiers against entries in authoritative biological sequence databases. SeqScrub provides a powerful, yet simple workflow for managing, enriching and exchanging data, which is crucial to establish a record of provenance for sequences found from broad and varied searches; for example, using BLAST on continually updated genome sequence sets. Headers standardized using SeqScrub can be parsed by a majority of bioinformatic tools, stay uniformly named between collaborators and contain informative labels to aid management of reproducible, scientific data.

          SeqScrub is available at http://bioinf.scmb.uq.edu.au/seqscrub

          METHOD SUMMARY

          SeqScrub is a web tool that takes a set of biological sequences in FASTA format and allows the user to: 1) ‘scrub’ files by removing unnecessary information from the sequence identifier such as characters and spaces that can cause input errors in bioinformatic tools; 2) check that sequences are not obsolete; and 3) annotate a sequence's taxonomic information onto the header.

          Most cited references1

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

          FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly. This paper describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OSX, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations. SeqKit is open source and available on Github at https://github.com/shenwei356/seqkit.
            Bookmark

            Author and article information

            Journal
            BTN
            BioTechniques
            Future Science Ltd (London, UK )
            0736-6205
            1940-9818
            20 June 2019
            August 2019
            : 67
            : 2
            : 50-54
            Affiliations
            1School of Chemistry & Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
            2Food Biotechnology Laboratory, Department of Food Sciences & Technology, BOKU University of Natural Resources & Life Sciences, Vienna, Austria
            Author notes
            [* ]Author for correspondence: gabriel.foley@ 123456uqconnect.edu.au
            Article
            10.2144/btn-2018-0188
            31218882
            e9ee3647-03c0-4393-9632-788bc4a539b1
            © 2019 Gabriel Foley

            This work is licensed under the Attribution-NonCommercial-NoDerivatives 4.0 Unported License

            History
            : 13 December 2018
            : 30 April 2019
            : 20 June 2019
            Page count
            Pages: 5
            Categories
            Report

            General life sciences,Cell biology,Molecular biology,Biotechnology,Genetics,Life sciences
            web application,data sanitization,data curation,data consistency,ancestral sequence reconstruction,taxonomic annotation

            Comments

            Comment on this article