11
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Eleven quick tips for data cleaning and feature engineering

      research-article
      1 , * , , 2 , 3 , 4
      PLOS Computational Biology
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call “feature” a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.

          Related collections

          Most cited references201

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Highly accurate protein structure prediction with AlphaFold

          Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Matplotlib: A 2D Graphics Environment

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The FAIR Guiding Principles for scientific data management and stewardship

              There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput Biol
                plos
                PLOS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                15 December 2022
                December 2022
                : 18
                : 12
                : e1010718
                Affiliations
                [1 ] Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
                [2 ] Dipartimento di Informatica Bioingegneria Robotica e Ingegneria dei Sistemi, Università di Genova, Genoa, Italy
                [3 ] ZenaByte S.r.l., Genoa, Italy
                [4 ] Dipartimento di Ingegneria dell’Informazione, Università di Padova, Padua, Italy
                McGill University, CANADA
                Author notes

                The authors declare they have no conflict of interest.

                Author information
                https://orcid.org/0000-0001-9655-7142
                https://orcid.org/0000-0002-8445-395X
                Article
                PCOMPBIOL-D-22-01382
                10.1371/journal.pcbi.1010718
                9754225
                36520712
                69a2ac79-3992-412b-98a3-f121fec8fe94
                © 2022 Chicco et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                Page count
                Figures: 0, Tables: 0, Pages: 21
                Funding
                The authors received no specific funding for this work.
                Categories
                Education
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Computer and Information Sciences
                Software Engineering
                Preprocessing
                Engineering and Technology
                Software Engineering
                Preprocessing
                Computer and Information Sciences
                Software Engineering
                Computer Software
                Engineering and Technology
                Software Engineering
                Computer Software
                Computer and Information Sciences
                Data Management
                Data Visualization
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Machine Learning Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Machine Learning Algorithms
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Machine Learning Algorithms
                Engineering and Technology
                Biology and Life Sciences
                Bioengineering
                Engineering and Technology
                Bioengineering

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article