16
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: not found
      • Article: not found

      Optimal prediction of the number of unseen species

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          <p id="d6535592e163">Many scientific applications ranging from ecology to genetics use a small sample to estimate the number of distinct elements, known as ”species,” in a population. Classical results have shown that <i>n</i> samples can be used to estimate the number of species that would be observed if the sample size were doubled to <span class="inline-formula"> <math id="i1" overflow="scroll"> <mrow> <mn>2</mn> <mi>n</mi> </mrow> </math> </span>. We obtain a class of simple algorithms that extend the estimate all the way to <span class="inline-formula"> <math id="i2" overflow="scroll"> <mrow> <mi>n</mi> <mo> </mo> <mi mathvariant="bold">log</mi> <mo> </mo> <mi>n</mi> </mrow> </math> </span> samples, and we show that this is also the largest possible estimation range. Therefore, statistically speaking, the proverbial bird in the hand is worth log <i>n</i> in the bush. The proposed estimators outperform existing ones on several synthetic and real datasets collected in various disciplines. </p><p class="first" id="d6535592e198">Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) <i>J Animal Ecol</i> 12(1):42−58], uses <i>n</i> samples to predict the number <i>U</i> of hitherto unseen species that would be observed if <span class="inline-formula"> <math id="i4" overflow="scroll"> <mrow> <mi>t</mi> <mo>⋅</mo> <mi>n</mi> </mrow> </math> </span> new samples were collected. Of considerable interest is the largest ratio <i>t</i> between the number of new and existing samples for which <i>U</i> can be accurately predicted. In seminal works, Good and Toulmin [Good I, Toulmin G (1956) <i>Biometrika</i> 43(102):45−63] constructed an intriguing estimator that predicts <i>U</i> for all <span class="inline-formula"> <math id="i5" overflow="scroll"> <mrow> <mi>t</mi> <mo>≤</mo> <mn>1</mn> </mrow> </math> </span>. Subsequently, Efron and Thisted [Efron B, Thisted R (1976) <i>Biometrika</i> 63(3):435−447] proposed a modification that empirically predicts <i>U</i> even for some <span class="inline-formula"> <math id="i6" overflow="scroll"> <mrow> <mi>t</mi> <mo>&gt;</mo> <mn>1</mn> </mrow> </math> </span>, but without provable guarantees. We derive a class of estimators that provably predict <i>U</i> all of the way up to <span class="inline-formula"> <math id="i7" overflow="scroll"> <mrow> <mi>t</mi> <mo>∝</mo> <mi>log</mi> <mo>⁡</mo> <mi>n</mi> </mrow> </math> </span>. We also show that this range is the best possible and that the estimator’s mean-square error is near optimal for any <i>t</i>. Our approach yields a provable guarantee for the Efron−Thisted estimator and, in addition, a variant with stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators are simple, linear, computationally efficient, and scalable to massive datasets. Their performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product. </p>

          Related collections

          Most cited references17

          • Record: found
          • Abstract: not found
          • Article: not found

          The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Molecular analysis of human forearm superficial skin bacterial biota.

            The microbial ecology of human skin is complex, but little is known about its species composition. We examined the diversity of the skin biota from the superficial volar left and right forearms in six healthy subjects using broad-range small subunit rRNA genes (16S rDNA) PCR-based sequencing of randomly selected clones. For the initial 1,221 clones analyzed, 182 species-level operational taxonomic units (SLOTUs) belonging to eight phyla were identified, estimated as 74.0% [95% confidence interval (C.I.), approximately 64.8-77.9%] of the SLOTUs in this ecosystem; 48.0 +/- 12.2 SLOTUs were found in each subject. Three phyla (Actinobacteria, Firmicutes, and Proteobacteria) accounted for 94.6% of the clones. Most (85.3%) of the bacterial sequences corresponded to known and cultivated species, but 98 (8.0%) clones, comprising 30 phylotypes, had <97% similarity to prior database sequences. Only 6 (6.6%) of the 91 genera and 4 (2.2%) of the 182 SLOTUs, respectively, were found in all six subjects. Analysis of 817 clones obtained 8-10 months later from four subjects showed additional phyla (numbering 2), genera (numbering 28), and SLOTUs (numbering 65). Only four (3.4%) of the 119 genera (Propionibacteria, Corynebacteria, Staphylococcus, and Streptococcus) were observed in each subject tested twice, but these genera represented 54.4% of all clones. These results show that the bacterial biota in normal superficial skin is highly diverse, with few well conserved and well represented genera, but otherwise low-level interpersonal consensus.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Counting the Uncountable: Statistical Approaches to Estimating Microbial Diversity

                Bookmark

                Author and article information

                Journal
                Proceedings of the National Academy of Sciences
                Proc Natl Acad Sci USA
                Proceedings of the National Academy of Sciences
                0027-8424
                1091-6490
                November 22 2016
                November 22 2016
                : 113
                : 47
                : 13283-13288
                Article
                10.1073/pnas.1607774113
                5127330
                27830649
                e5d3dfcd-ede9-413b-98a2-67d1586cfc97
                © 2016
                History

                Comments

                Comment on this article