Eleven quick tips for data cleaning and feature engineering

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call “feature” a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.

Related collections

Most cited references 201

Record: found
Abstract: found
Article: found

Is Open Access

Highly accurate protein structure prediction with AlphaFold

John Jumper, Richard Evans, Alexander Pritzel … (2021)

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

0 comments Cited 8422 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Matplotlib: A 2D Graphics Environment

John Hunter (2007)

0 comments Cited 5434 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The FAIR Guiding Principles for scientific data management and stewardship

Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg … (2016)

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

0 comments Cited 2970 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Francis Ouellette: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput Biol

Journal ID (publisher-id): plos

Title: PLOS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Electronic): 15 December 2022

Publication date Collection: December 2022

Volume: 18

Issue: 12

Electronic Location Identifier: e1010718

Affiliations

[1 ] Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada

[2 ] Dipartimento di Informatica Bioingegneria Robotica e Ingegneria dei Sistemi, Università di Genova, Genoa, Italy

[3 ] ZenaByte S.r.l., Genoa, Italy

[4 ] Dipartimento di Ingegneria dell’Informazione, Università di Padova, Padua, Italy

McGill University, CANADA

Author notes

The authors declare they have no conflict of interest.

* E-mail: davidechicco@ 123456davidechicco.it

Author information

Davide Chicco https://orcid.org/0000-0001-9655-7142

Luca Oneto https://orcid.org/0000-0002-8445-395X

Article

Publisher ID: PCOMPBIOL-D-22-01382

DOI: 10.1371/journal.pcbi.1010718

PMC ID: 9754225

PubMed ID: 36520712

SO-VID: 69a2ac79-3992-412b-98a3-f121fec8fe94

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Page count

Figures: 0, Tables: 0, Pages: 21

Funding

The authors received no specific funding for this work.

Comments

Comment on this article

scite_

Cited by 5

See all cited by

Most referenced authors 3,954

See all reference authors

Eleven quick tips for data cleaning and feature engineering

Read this article at

Abstract

Related collections

Network and Systems Medicine

Most cited references 201

Highly accurate protein structure prediction with AlphaFold

Matplotlib: A 2D Graphics Environment

The FAIR Guiding Principles for scientific data management and stewardship

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 108

Cited by 5

Most referenced authors 3,954