2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          A novel list of potential biomarkers was generated from RNA-seq expression data and used to optimise cancer classification.

          Abstract

          The Cancer Genome Atlas has provided expression values of 18 015 genes for different cancer types. Studies on the classification of cancers by machine learning algorithms have used different data and methods, which makes it difficult to compare their performance. It is unclear, which algorithm performs best and if maximum levels of accuracy have been obtained. In this study, we aimed to optimise the diagnosis of cancer by comparing the performance of five algorithms using the same data, and by identifying the smallest possible number of differentiator genes. Classification accuracies of five algorithms of cancer type and primary site were determined using a gene expression dataset of 5629 samples and a dataset of 9144 samples, respectively. When trained with sample sets ranging from 16 718 to 40 genes, Random Forest (RF), Gradient Boosting Machine (GBM), and Neural Network (NN) consistently achieved 100% or near 100% accuracy in the classification of both cancer type and primary site. Reduction of training sets to the 40 highest-ranked genes resulted in 78-fold and 45-fold faster processing times for RF and GBM, respectively. The olfactory receptor family, keratin associated proteins, and defensin beta family were among the highest ranked genes. The ensemble and NN algorithms were the most accurate at distinguishing between cancer types and primary sites, whereas KNN was the fastest. Training sets can be reduced to the 40 highest-ranked differentiator genes without any significant loss of accuracy, amongst which there are potential drug targets and biomarkers.

          Related collections

          Most cited references7

          • Record: found
          • Abstract: not found
          • Book: not found

          The Elements of Statistical Learning

            Bookmark
            • Record: found
            • Abstract: not found
            • Book: not found

            Classification and Regression Trees

              Bookmark
              • Record: found
              • Abstract: not found
              • Book: not found

              Ensemble Machine Learning

                Bookmark

                Author and article information

                Contributors
                (View ORCID Profile)
                Journal
                MOOMAW
                Molecular Omics
                Mol. Omics
                Royal Society of Chemistry (RSC)
                2515-4184
                April 14 2020
                2020
                : 16
                : 2
                : 113-125
                Affiliations
                [1 ]Department of Electrical and Computer Engineering
                [2 ]University of the West Indies
                [3 ]Saint Augustine
                [4 ]Trinidad and Tobago
                [5 ]Department of Pre-Clinical Sciences
                Article
                10.1039/C9MO00198K
                940012e3-efe3-413f-84b4-3296aecc1f56
                © 2020

                http://rsc.li/journals-terms-of-use

                History

                Comments

                Comment on this article