Implementation of machine learning in DNA barcoding for determining the plant family taxonomy

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research.

Related collections

Most cited references 63

Record: found
Abstract: found
Article: not found

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

J. D. Thompson, D. G. Higgins, T. J. Gibson (1994)

The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.

0 comments Cited 955 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Algorithm AS 136: A K-Means Clustering Algorithm

J. A. Hartigan, M. A. Wong (1979)

0 comments Cited 819 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi.

C. L. Schoch, K. A. Seifert, S. Huhndorf … (2012)

Six DNA regions were evaluated as potential DNA barcodes for Fungi, the second largest kingdom of eukaryotic life, by a multinational, multilaboratory consortium. The region of the mitochondrial cytochrome c oxidase subunit 1 used as the animal barcode was excluded as a potential marker, because it is difficult to amplify in fungi, often includes large introns, and can be insufficiently variable. Three subunits from the nuclear ribosomal RNA cistron were compared together with regions of three representative protein-coding genes (largest subunit of RNA polymerase II, second largest subunit of RNA polymerase II, and minichromosome maintenance protein). Although the protein-coding gene regions often had a higher percent of correct identification compared with ribosomal markers, low PCR amplification and sequencing success eliminated them as candidates for a universal fungal barcode. Among the regions of the ribosomal cistron, the internal transcribed spacer (ITS) region has the highest probability of successful identification for the broadest range of fungi, with the most clearly defined barcode gap between inter- and intraspecific variation. The nuclear ribosomal large subunit, a popular phylogenetic marker in certain groups, had superior species resolution in some taxonomic groups, such as the early diverging lineages and the ascomycete yeasts, but was otherwise slightly inferior to the ITS. The nuclear ribosomal small subunit has poor species-level resolution in fungi. ITS will be formally proposed for adoption as the primary fungal barcode marker to the Consortium for the Barcode of Life, with the possibility that supplementary barcodes may be developed for particular narrowly circumscribed taxonomic groups.

0 comments Cited 702 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Lala Septem Riza

Muhammad Iqbal Zain

Ahmad Izzuddin

Yudi Prasetyo

Topik Hidayat

Khyrina Airin Fariza Abu Samah

Journal

Journal ID (nlm-ta): Heliyon

Journal ID (iso-abbrev): Heliyon

Title: Heliyon

Publisher: Elsevier

ISSN (Electronic): 2405-8440

Publication date PMC-release: 21 September 2023

Publication date Collection: October 2023

Publication date (Electronic): 21 September 2023

Volume: 9

Issue: 10

Electronic Location Identifier: e20161

Affiliations

[a ]Department of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, Indonesia

[b ]Department of Biology Education, Universitas Pendidikan Indonesia, Bandung, Indonesia

[c ]Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Cawangan Melaka Kampus Jasin, Melaka, Malaysia

Author notes

[∗ ]Corresponding author. lala.s.riza@ 123456upi.edu

Article

Publisher Item ID: S2405-8440(23)07369-3 Publisher ID: e20161

DOI: 10.1016/j.heliyon.2023.e20161

PMC ID: 10520734

PubMed ID: 37767518

SO-VID: 130a32ca-8f9c-408f-80ab-791a05636e13

License:

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

History

Date received : 9 October 2022

Date revision received : 5 September 2023

Date accepted : 13 September 2023

Implementation of machine learning in DNA barcoding for determining the plant family taxonomy

Read this article at

Abstract

Related collections

Annual Reviews AI, Machine Learning, and Society

Most cited references 63

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Algorithm AS 136: A K-Means Clustering Algorithm

Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 120

Most referenced authors 1,545