A Study of "Wheat" and "Chaff" in Source Code

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Natural language is robust against noise. The meaning of many sentences survives the loss of words, sometimes many of them. Some words in a sentence, however, cannot be lost without changing the meaning of the sentence. We call these words "wheat" and the rest "chaff." The word "not" in the sentence "I do not like rain" is wheat and "do" is chaff. For human understanding of the purpose and behavior of source code, we hypothesize that the same holds. To quantify the extent to which we can separate code into "wheat" and "chaff", we study a large (100M LOC), diverse corpus of real-world projects in Java. Since methods represent natural, likely distinct units of code, we use the, approximately, 9M Java methods in the corpus to approximate a universe of "sentences." We "thresh", or lex, functions, then "winnow" them to extract their wheat by computing the minimal distinguishing subset (MINSET). Our results confirm that programs contain much chaff. On average, MINSETS have 1.56 words (none exceeds 6) and comprise 4% of their methods. Beyond its intrinsic scientific interest, our work offers the first quantitative evidence for recent promising work on keyword-based programming and insight into how to develop powerful, alternative programming systems.

Related collections

Most cited references 20

Record: found
Abstract: not found
Conference Proceedings: not found

On the naturalness of software

Earl Barr, Mark Gabel, Premkumar Devanbu … (2012)

0 comments Cited 70 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

From essential to persistent genes: a functional approach to constructing synthetic life

Carlos G. Acevedo Rocha, Li-Gang Fang, Markus Schmidt … (2013)

A central undertaking in synthetic biology (SB) is the quest for the ‘minimal genome’. However, ‘minimal sets’ of essential genes are strongly context-dependent and, in all prokaryotic genomes sequenced to date, not a single protein-coding gene is entirely conserved. Furthermore, a lack of consensus in the field as to what attributes make a gene truly essential adds another aspect of variation. Thus, a universal minimal genome remains elusive. Here, as an alternative to defining a minimal genome, we propose that the concept of gene persistence can be used to classify genes needed for robust long-term survival. Persistent genes, although not ubiquitous, are conserved in a majority of genomes, tend to be expressed at high levels, and are frequently located on the leading DNA strand. These criteria impose constraints on genome organization, and these are important considerations for engineering cells and for creating cellular life-like forms in SB.

0 comments Cited 49 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

A study of the uniqueness of source code

Zhendong Su, Mark Gabel (2010)

0 comments Cited 19 times – based on 0 reviews

Bookmark

All references

Author and article information

Journal

Publication date Created: 2015-02-04

Article

ArXiV ID: 1502.01410

SO-VID: 9dd6fd50-6cb0-420b-92d0-d8d3a1c4c9cd

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments 10 pages, Under Submission

Categories cs.SE

ScienceOpen disciplines: Software engineering

Data availability:

ScienceOpen disciplines: Software engineering

A Study of "Wheat" and "Chaff" in Source Code

Read this article at

Abstract

Related collections

African e-Infrastructure Commons

Most cited references 20

On the naturalness of software

From essential to persistent genes: a functional approach to constructing synthetic life

A study of the uniqueness of source code

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 529

Most referenced authors 147