2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Information Theory and the Length Distribution of all Discrete Systems

      Preprint
      ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We begin with the extraordinary observation that the length distribution of 80 million proteins in UniProt, the Universal Protein Resource, measured in amino acids, is qualitatively identical to the length distribution of large collections of computer functions measured in programming language tokens, at all scales. That two such disparate discrete systems share important structural properties suggests that yet other apparently unrelated discrete systems might share the same properties, and certainly invites an explanation. We demonstrate that this is inevitable for all discrete systems of components built from tokens or symbols. Departing from existing work by embedding the Conservation of Hartley-Shannon information (CoHSI) in a classical statistical mechanics framework, we identify two kinds of discrete system, heterogeneous and homogeneous. Heterogeneous systems contain components built from a unique alphabet of tokens and yield an implicit CoHSI distribution with a sharp unimodal peak asymptoting to a power-law. Homogeneous systems contain components each built from just one kind of token unique to that component and yield a CoHSI distribution corresponding to Zipf's law. This theory is applied to heterogeneous systems, (proteome, computer software, music); homogeneous systems (language texts, abundance of the elements); and to systems in which both heterogeneous and homogeneous behaviour co-exist (word frequencies and word length frequencies in language texts). In each case, the predictions of the theory are tested and supported to high levels of statistical significance. We also show that in the same heterogeneous system, different but consistent alphabets must be related by a power-law. We demonstrate this on a large body of music by excluding and including note duration in the definition of the unique alphabet of notes.

          Related collections

          Author and article information

          Journal
          06 September 2017
          Article
          1709.01712
          0a450351-a08d-45bc-88c9-88eef755b3f9

          http://arxiv.org/licenses/nonexclusive-distrib/1.0/

          History
          Custom metadata
          70 pages, 53 figures, inc. 30 pages of Appendices
          q-bio.OT cs.IT cs.PL math.IT physics.bio-ph physics.soc-ph q-bio.PE

          Comments

          Comment on this article