5
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Statistical representation models for mutation information within genomic data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is.

          Results

          We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation.

          Conclusions

          As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes.

          Related collections

          Most cited references31

          • Record: found
          • Abstract: not found
          • Article: not found

          A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme.

            The objective of this study is to investigate the use of pattern classification methods for distinguishing different types of brain tumors, such as primary gliomas from metastases, and also for grading of gliomas. The availability of an automated computer analysis tool that is more objective than human readers can potentially lead to more reliable and reproducible brain tumor diagnostic procedures. A computer-assisted classification method combining conventional MRI and perfusion MRI is developed and used for differential diagnosis. The proposed scheme consists of several steps including region-of-interest definition, feature extraction, feature selection, and classification. The extracted features include tumor shape and intensity characteristics, as well as rotation invariant texture features. Feature subset selection is performed using support vector machines with recursive feature elimination. The method was applied on a population of 102 brain tumors histologically diagnosed as metastasis (24), meningiomas (4), gliomas World Health Organization grade II (22), gliomas World Health Organization grade III (18), and glioblastomas (34). The binary support vector machine classification accuracy, sensitivity, and specificity, assessed by leave-one-out cross-validation, were, respectively, 85%, 87%, and 79% for discrimination of metastases from gliomas and 88%, 85%, and 96% for discrimination of high-grade (grades III and IV) from low-grade (grade II) neoplasms. Multiclass classification was also performed via a one-vs-all voting scheme. (c) 2009 Wiley-Liss, Inc.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Reliability and reproducibility issues in DNA microarray measurements.

              DNA microarrays enable researchers to monitor the expression of thousands of genes simultaneously. However, the current technology has several limitations. Here we discuss problems related to the sensitivity, accuracy, specificity and reproducibility of microarray results. The existing data suggest that for relatively abundant transcripts the existence and direction (but not the magnitude) of expression changes can be reliably detected. However, accurate measurements of absolute expression levels and the reliable detection of low abundance genes are difficult to achieve. The main problems seem to be the sub-optimal design or choice of probes and some incorrect probe annotations. Well-designed data-analysis approaches can rectify some of these problems.
                Bookmark

                Author and article information

                Contributors
                arzucan.ozgur@boun.edu.tr
                gurgen@boun.edu.tr
                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central (London )
                1471-2105
                13 June 2019
                13 June 2019
                2019
                : 20
                : 324
                Affiliations
                ISNI 0000 0001 2253 9056, GRID grid.11220.30, Department of Computer Engineering, Boğaziçi University, ; İstanbul, Turkey
                Article
                2868
                10.1186/s12859-019-2868-4
                6567431
                31195961
                b0295d26-bc19-40ab-b3e1-8d319f92e3eb
                © The Author(s) 2019

                Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 15 November 2018
                : 30 April 2019
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2019

                Bioinformatics & Computational biology
                information retrieval,machine learning,tf-idf,tf-rf,bm25,dna mutations,gene weighting,disease classification

                Comments

                Comment on this article