22
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A Study of "Wheat" and "Chaff" in Source Code

      Preprint
      , , , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Natural language is robust against noise. The meaning of many sentences survives the loss of words, sometimes many of them. Some words in a sentence, however, cannot be lost without changing the meaning of the sentence. We call these words "wheat" and the rest "chaff." The word "not" in the sentence "I do not like rain" is wheat and "do" is chaff. For human understanding of the purpose and behavior of source code, we hypothesize that the same holds. To quantify the extent to which we can separate code into "wheat" and "chaff", we study a large (100M LOC), diverse corpus of real-world projects in Java. Since methods represent natural, likely distinct units of code, we use the, approximately, 9M Java methods in the corpus to approximate a universe of "sentences." We "thresh", or lex, functions, then "winnow" them to extract their wheat by computing the minimal distinguishing subset (MINSET). Our results confirm that programs contain much chaff. On average, MINSETS have 1.56 words (none exceeds 6) and comprise 4% of their methods. Beyond its intrinsic scientific interest, our work offers the first quantitative evidence for recent promising work on keyword-based programming and insight into how to develop powerful, alternative programming systems.

          Related collections

          Most cited references20

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          On the naturalness of software

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            From essential to persistent genes: a functional approach to constructing synthetic life

            A central undertaking in synthetic biology (SB) is the quest for the ‘minimal genome’. However, ‘minimal sets’ of essential genes are strongly context-dependent and, in all prokaryotic genomes sequenced to date, not a single protein-coding gene is entirely conserved. Furthermore, a lack of consensus in the field as to what attributes make a gene truly essential adds another aspect of variation. Thus, a universal minimal genome remains elusive. Here, as an alternative to defining a minimal genome, we propose that the concept of gene persistence can be used to classify genes needed for robust long-term survival. Persistent genes, although not ubiquitous, are conserved in a majority of genomes, tend to be expressed at high levels, and are frequently located on the leading DNA strand. These criteria impose constraints on genome organization, and these are important considerations for engineering cells and for creating cellular life-like forms in SB.
              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              A study of the uniqueness of source code

                Bookmark

                Author and article information

                Journal
                2015-02-04
                Article
                1502.01410
                9dd6fd50-6cb0-420b-92d0-d8d3a1c4c9cd

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                10 pages, Under Submission
                cs.SE

                Software engineering
                Software engineering

                Comments

                Comment on this article