A fast and efficient algorithm for DNA sequence similarity identification

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-mer$$\end{document} count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-mer$$\end{document} . We develop an efficient system for finding the positions of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-mer$$\end{document} in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.

Related collections

Most cited references 34

Record: found
Abstract: found
Article: not found

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

N Saitou, M Nei (1987)

A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.

0 comments Cited 2287 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Mash: fast genome and metagenome distance estimation using MinHash

Brian Ondov, Todd Treangen, Páll Melsted … (2016)

Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash). Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0997-x) contains supplementary material, which is available to authorized users.

0 comments Cited 849 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data.

O. Gascuel (1997)

We propose an improved version of the neighbor-joining (NJ) algorithm of Saitou and Nei. This new algorithm, BIONJ, follows the same agglomerative scheme as NJ, which consists of iteratively picking a pair of taxa, creating a new mode which represents the cluster of these taxa, and reducing the distance matrix by replacing both taxa by this node. Moreover, BIONJ uses a simple first-order model of the variances and covariances of evolutionary distance estimates. This model is well adapted when these estimates are obtained from aligned sequences. At each step it permits the selection, from the class of admissible reductions, of the reduction which minimizes the variance of the new distance matrix. In this way, we obtain better estimates to choose the pair of taxa to be agglomerated during the next steps. Moreover, in comparison with NJ's estimates, these estimates become better and better as the algorithm proceeds. BIONJ retains the good properties of NJ--especially its low run time. Computer simulations have been performed with 12-taxon model trees to determine BIONJ's efficiency. When the substitution rates are low (maximum pairwise divergence approximately 0.1 substitutions per site) or when they are constant among lineages, BIONJ is only slightly better than NJ. When the substitution rates are higher and vary among lineages,BIONJ clearly has better topological accuracy. In the latter case, for the model trees and the conditions of evolution tested, the topological error reduction is on the average around 20%. With highly-varying-rate trees and with high substitution rates (maximum pairwise divergence approximately 1.0 substitutions per site), the error reduction may even rise above 50%, while the probability of finding the correct tree may be augmented by as much as 15%.

0 comments Cited 613 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Machbah Uddin:

ORCID: http://orcid.org/0000-0002-1572-8606

machbah.csm@bau.edu.bd

Mohammad Khairul Islam:

ORCID: http://orcid.org/0000-0001-5432-1406

mkislam@cu.ac.bd

Md. Rakib Hassan:

ORCID: http://orcid.org/0000-0001-9368-6942

rakib@bau.edu.bd

Farah Jahan:

ORCID: http://orcid.org/0000-0001-9991-9575

farah_csc@cu.ac.bd

Joong Hwan Baek:

ORCID: http://orcid.org/0000-0003-4606-6737

jhbaek@kau.ac.kr

Journal

Journal ID (nlm-ta): Complex Intell Systems

Journal ID (iso-abbrev): Complex Intell Systems

Title: Complex & Intelligent Systems

Publisher: Springer International Publishing (Cham )

ISSN (Print): 2199-4536

ISSN (Electronic): 2198-6053

Publication date (Electronic): 23 August 2022

Publication date PMC-release: 23 August 2022

Pages: 1-16

Affiliations

[1 ]GRID grid.413089.7, ISNI 0000 0000 9744 3393, Department of Computer Science and Engineering, , University of Chittagong, ; Chittagong, 4331 Bangladesh

[2 ]GRID grid.411511.1, ISNI 0000 0001 2179 3896, Department of Computer Science and Mathematics, , Bangladesh Agricultural University, ; Mymensingh, 2202 Bangladesh

[3 ]GRID grid.440941.c, ISNI 0000 0000 9881 3149, School of Electronics, Telecommunication and Computer Engineering, , Korea Aerospace University, ; Goyang-si, South Korea

Author information

Machbah Uddin http://orcid.org/0000-0002-1572-8606

Mohammad Khairul Islam http://orcid.org/0000-0001-5432-1406

Md. Rakib Hassan http://orcid.org/0000-0001-9368-6942

Farah Jahan http://orcid.org/0000-0001-9991-9575

Joong Hwan Baek http://orcid.org/0000-0003-4606-6737

Article

Publisher ID: 846

DOI: 10.1007/s40747-022-00846-y

PMC ID: 9395857

PubMed ID: 36035628

SO-VID: 4766edcc-2fc7-4bbc-b0a5-36e79bd5c6e7

License:

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

History

Date received : 2 September 2021

Date accepted : 5 August 2022

Funding

Funded by: ICT Division, Ministry of Posts, Telecommunications and Information Technology, Government of Bangladesh

Award ID: SL 311, 1st Round, 2020-2021

Award Recipient : Machbah Uddin

Comments

Comment on this article

scite_

Cited by 1

Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent
Authors: Alisa Prusokiene, Neil Boonham, Adrian Fox …

See all cited by

Most referenced authors 332

See all reference authors

A fast and efficient algorithm for DNA sequence similarity identification

Read this article at

Abstract

Related collections

Resource Identification

Most cited references 34

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Mash: fast genome and metagenome distance estimation using MinHash

BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data.

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Comments

Comment on this article

Similar content 13

Cited by 1

Most referenced authors 332