Ten simple rules for biologists learning to program

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Introduction As big data and multi-omics analyses are becoming mainstream, computational proficiency and literacy are essential skills in a biologist’s tool kit. All “omics” studies require computational biology: the implementation of analyses requires programming skills, while experimental design and interpretation require a solid understanding of the analytical approach. While academic cores, commercial services, and collaborations can aid in the implementation of analyses, the computational literacy required to design and interpret omics studies cannot be replaced or supplemented. However, many biologists are only trained in experimental techniques. We write these 10 simple rules for traditionally trained biologists, particularly graduate students interested in acquiring a computational skill set. Rule 1: Begin with the end in mind When picking your first language, focus on your goal. Do you want to become a programmer? Do you want to design bioinformatic tools? Do you want to implement tools? Do you want to just get these data analyzed already? Pick an approach and language that fits your long- and short-term goals. Languages vary in intent and usage. Each language and package was created to solve a particular problem, so there is no universal “best” language (Fig 1). Pick the right tool for the job by choosing a language that is well suited for the biological questions you want to ask. If many people in your field use a language, it likely works well for the types of problems you will encounter. If people in your field use a variety of languages, you have options. To evaluate ease of use, consider how much community support a language has and how many resources that community has created, such as prevalence of user development, package support (documentation and tutorials), and the language’s “presence” on help pages. Practically, languages vary in cost for academic and commercial use. Free languages are more amenable to open source work (i.e., sharing your analyses or packages). See Table 1 for a brief discussion of several programming languages, their key features, and where to learn more. 10.1371/journal.pcbi.1005871.g001 Fig 1 The “one tool to rule them all” (or: how programming languages do not work). 10.1371/journal.pcbi.1005871.t001 Table 1 A noninclusive discussion of programming languages. A shell is a command line (i.e., programming) interface to an operating system, like Unix operating systems. Low-level programming languages deal with a computer’s hardware. The process of moving from the literal processor instructions toward human-readable applications is called “abstraction.” Low-level languages require little abstraction. Interpreted languages are quicker to test (e.g., to run a few lines of code); this facilitates learning through trial and error. Interpreted languages tend to be more human readable. Compiled languages are powerful because they are often more efficient and can be used for low-level tasks. However, the distinction between interpreted and compiled languages is not always rigid. All languages presented below are free unless noted otherwise. The Wikipedia page on programming languages provides a great overview and comparison of languages. Language Key features Documentation Sample tutorials Community groups Bash • Most common Unix shell• Practical for execution of scripts written in all other languages• Versatile• Easy to delete files or make other drastic changes• Weaknesses include executing math and limited data structures• Default for macOS and most Linux distributions • gnu.org/software/bash/manual/ • On macOS’s terminal, type “man <command>” to get the manual for any command (and “q” to exit manual page) • The Linux Documentation Project’s Beginner’s guide: tldp.org/LDP/Bash-Beginners-Guide/html/ • Ubuntu’s documentation: help.ubuntu.com/community/Beginners/BashScripting • Azet’s GitHub page: github.com/azet/community_bash_style_guide • Google Plus: plus.google.com/communities/110832059019676429606 • GitHub community resources page: github.com/awesome-lists/awesome-bash Python • General purpose language• Considered easy to learn due to readability• Flexible syntax considered both a strength and weakness• Interpreted language • docs.python.org • Google’s Python class: developers.google.com/edu/python/ • The Hitchhiker’s Guide to Python: docs.python-guide.org/ • Python Users Group: wiki.python.org/moin/LocalUserGroups • Python Special Interest Groups: python.org/community/sigs/ R • Community involvement• Application-focused development• Easy to learn by coupling basic programming and applications• Well-developed visualization• Variable package quality• “Tidy data” community• Interpreted language • rdocumentation.org • r-project.org • cran.r-project.org • R for cats: rforcats.net • Books by Hadley Wickham: hadley.nz • R Tutorial’s introduction: r-tutor.com/r-introduction • Cyclismo’s R Tutorial: cyclismo.org/tutorial/R/ • R-Ladies: rladies.org • R Users Group: many SAS • Statistical computing• High-quality development of statistical functions by commercial and academic developers• Domain-specific usage• Free for students only• Typically a compiled language • support.sas.com • Boston University’s SAS Training for Statistics: bu.edu/stat/bu-student-chapter-of-the-asa/sas-training/ • SAS User Groups: sas.com/en_us/connect/user-groups.html MATLAB • Well-developed applications in engineering• Maintained professionally• Interpreted language• Discounted academic license • mathworks.com/help/matlab • Cyclismo’s MATLAB Tutorial: cyclismo.org/tutorial/matlab/ • For purchase courses offered at: matlabacademy.mathworks.com • MATLAB Central: mathworks.com/matlabcentral/ Perl • General purpose language• Handles text well• Waning community involvement• Syntax modelled after human language• Interpreted language • perl.org • cpan.org • Beginning Perl: perl.org/books/beginning-perl/ • Perl maven’s tutorial: perlmaven.com • Perl::Learn: learn.perl.org • Perl Mongers: pm.org • Perl Monks: perlmonks.org Fortran • Numeric computation• Fast• Often used for high-performance computing• Limited development• Compiled language • fortranwiki.org • many at Fortran wiki: fortranwiki.org/fortran/show/Tutorials • Fortran Friends: fortran.orpheusweb.co.uk C/C++ • Low-level language• Powerful, used for source code of many other languages• Challenging to learn as it requires explicit syntax• Explicit syntax enforces good programming habits• Compiled language • devdocs.io/c • cppreference.com • C programming’s tutorial: cprogramming.com/tutorial/ • Learn-C’s web-based tutorial: learn-c.org • Standard C++ Foundation: isocpp.org • C/C++ Users Group (CUG): hal9k.com/cug Rule 2: Baby steps are steps Once you’ve begun, focus on one task at a time and apply your critical thinking and problem solving skills. This requires breaking a problem down into steps. Analyzing omics data may sound challenging, but the individual steps do not: e.g., read your data, decide how to interpret missing values, scale as needed, identify comparison conditions, divide to calculate fold change, calculate significance, correct for multiple testing. Break a large problem into modular tasks and implement one task at a time. Iteratively edit for efficiency, flow, and succinctness. Mistakes will happen. That’s ok; what matters is that you find, correct, and learn from them. Rule 3: Immersion is the best learning tool Don’t stitch together an analysis by switching between or among languages and/or point and click environments (Excel [Microsoft; https://www.microsoft.com/en-us/], etc.). While learning, if a job can be done in one language or environment, do it all there. For example, importing a spreadsheet of data (like you would view in Excel) is not necessarily straightforward; Excel automatically determines how to read text, but the method may differ from conventions in other programming languages. If the import process “misreads” your data (e.g., blank cells are not read as blank or “NA,” numbers are in quotes indicating that they are read as text, or column names are not maintained), it can be tempting to return to Excel to fix these with search-and-replace strategies. However, these problems can be fixed by correctly reading the data and by understanding the language’s data structures. Just like a spoken language [1, 2], immersion is the best learning tool [3, 4]. In addition to slowing the learning curve, transferring across programs induces error. See References [5–7] for additional Excel or word processing–induced errors. Eventually, you may identify tasks that are not well suited to the language you use. At that point, it may be helpful to pick up another language in order to use the right tool for the job (see Rule 1). In fact, understanding one language will make it easier to learn a second. Until then, however, focus on immersion to learn. Rule 4: Phone a friend There are numerous online resources: tutorials, documentation, and sites intended for community Q and A (StackOverflow, StackExchange, Biostars, etc.), but nothing replaces a friend or colleague’s help. Find a community of programmers, ranging from beginning to experienced users, to ask for help. You may want to look for both technical support (i.e., a group centered around a language) and support regarding a particular scientific application (e.g., a group centered around omics analyses). Many universities have scientific computing groups, housed in the library or information technology (IT) department; these groups can be your starting point. If your lab or university does not have a community of programmers, seek them out virtually or locally. Coursera courses, for example, have comment boards for students to answer each other’s questions and learn from their peers. Organizations like Software and Data Carpentry or language user groups have mailing lists to connect members. Many cities have events organized by language-specific user groups or interest groups focused on big data, machine learning, or data visualization. These can be found through meetup.com, Google groups, or through a user group’s website; some are included in Table 1. Once you find a community, ask for help. At the beginning stages, in-person help to deconstruct or interpret an online answer is invaluable. Additionally, ask a friend for code. You wouldn’t write a paper without first reading a lot of papers or begin a new project without shadowing a few experimenters. First, read their code. Implement and interpret, trying to understand each line. Return to discuss your questions. Once you begin writing, ask for edits. Rule 5: Learn how to ask questions There’s an answer to almost anything online, but you have to know what to ask to get help. In order to know what to ask, you have to understand the problem. Start by interpreting an error message. Watch for generic errors and learn from them. Identify which component of your error message indicates what the issue is and which component indicates where the issue is (Figs 2–5). Understanding the problem is essential; this process is called “debugging.” Without truly understanding the problem, any “solution” will ultimately propagate and escalate the mistake, making harder-to-interpret errors down the road. Once you understand the problem, look for answers. Looking for answers requires effective googling. Learn the vocabulary (and meta-vocabulary) of the language and its users. Once you understand the problem and have identified that there is no obvious (and publicly available) solution, ask for answers in programming communities (see Rule 4 and Table 1). When asking, paraphrase the fundamental problem. Include error messages and enough information to reproduce the problem (include packages, versions, data or sample data, code, etc.). Present a brief summary of what was done, what was intended, how you interpret the problem, what troubleshooting steps were already taken, and whether you have searched other posts for the answer. 10.1371/journal.pcbi.1005871.g002 Fig 2 Anatomy of an error message, Part 1 (or: How to write more than one line of code). Here we show an example of the debugging process in R using the RStudio environment, with the goal of concatenating two words. 10.1371/journal.pcbi.1005871.g003 Fig 3 Anatomy of an error message, Part 2 (or: Just because it works, doesn’t mean it’s right). Here we provide more examples of the debugging process. Examples shown in Figs 3–5 are conducted in Python using a Jupyter notebook. Environments like RStudio (in Fig 2) and Jupyter notebooks are two examples of integrated development environments; these environments offer additional support, including built-in debugging tools. First, we show an error that does not induce an error message, but the user must debug nonetheless. 10.1371/journal.pcbi.1005871.g004 Fig 4 Anatomy of an error message, Part 3 (or: Trace your way back to the problem). Here we show an explicit error message. 10.1371/journal.pcbi.1005871.g005 Fig 5 Anatomy of an error message, Part 4 (or: Debugging a solution). Lastly, we show how to debug a solution to understand a line of code found on the internet. See the following website for suggestions: http://codereview.stackexchange.com/help/how-to-ask and [8]. End with a “thank you” and wait for the help to arrive. Rule 6: Don’t reinvent the wheel Rule 6 can also be found in “Ten Simple Rules for the Open Development of Scientific Software” [9], “Ten Simple Rules for Developing Public Biological Databases” [10], “Ten Simple Rules for Cultivating Open Science and Collaborative R&D” [11], and “Ten Simple Rules To Combine Teaching and Research” [12]. Use all resources available to you, including online tutorials, examples in the language’s documentation, published code, cool snippets of code your labmate shared, and, yes, your own work. Read widely to identify these resources. Copy-and-paste is your friend. Provide credit if appropriate (i.e., comment “adapted from so-n-so’s X script”) or necessary (e.g., read through details on software licenses). Document your scripts by commenting in notes to yourself so that you can use old code as a template for future work. These comments will help you remember what each line of code intends to do, accelerating your ability to find mistakes. Rule 7: Develop good habits early on Computational research is research, so use your best practices. This includes maintaining a computational lab notebook and documenting your code. A computational lab notebook is by definition a lab notebook: your lab notebook includes protocols, so your computational lab notebook should include protocols, too. Computational protocols are scripts, and these should include the code itself and how to access everything needed to implement the code. Include input (raw data) and output (results), too. Figures and interpretation can be included if that’s how you organize your lab notebook. Develop computational “place habits” (file-saving strategies). It is easier to organize one drawer than it is to organize a whole lab, so start as soon as you begin to learn to program. If you can find that experiment you did on June 12, 2011—its protocol and results—in under five minutes, you should be able to find that figure you generated for lab meeting three weeks ago, complete with code and data, in under five minutes as well. This requires good version control or documentation of your work. Like with protocols, each time you run a script, you should note any modifications that are made. Document all changes in experimental and computational protocols. These habits will make you more efficient by enhancing your work’s reproducibility. For specific advice, see “Ten Simple Rules for a Computational Biologist’s Laboratory Notebook” [13], “Ten Simple Rules for Reproducible Computational Research” [14], and “Ten Simple Rules for Taking Advantage of Git and GitHub” [15]. Rule 8: Practice makes perfect Use toy datasets to practice a problem or analysis. Biological data get big, fast. It’s hard to find the computational needle-in-a-haystack, so set yourself up to succeed by practicing in controlled environments with simpler examples. Generate small toy datasets that use the same structure as your data. Make the toy data simple enough to predict how the numbers, text, etc., should react in your analysis. Test to ensure they do react as expected. This will help you understand what is being done in each step and troubleshoot errors, preparing you to scale up to large, unpredictable datasets. Use these datasets to test your approach, your implementation, and your interpretation. Toy datasets are your negative control, allowing you to differentiate between negative results and simulation failure. Rule 9: Teach yourself How would you teach you if you were another person? You would teach with a little more patience and a bit more empathy than you are practicing now. You are not alone in your occasional frustration (Fig 6). Learning takes time, so plan accordingly. Introductory courses are helpful to learn the basics because the basics are easy to neglect in self-study. Articulate clear expectations for yourself and benchmarks for success. Apply some of the structure (deadlines, assignments, etc.) you would provide a student to help motivate and evaluate your progress. If something isn’t working, adjust; not everyone learns best by any one approach. Explore tutorials, online classes, workshops, books like Practical Computing for Biologists [16], local programming meetups, etc., to find your preferred approach. 10.1371/journal.pcbi.1005871.g006 Fig 6 “How to exit the vim editor?” (or: We all get stuck at some point). Now viewed >1.33 million times; see: http://stackoverflow.com/questions/11828270/how-to-exit-the-vim-editor. Rule 10: Just do it Just start coding. You can’t edit a blank page. Learning to program can be intimidating. The power and freedom provided in conducting your own computational analyses bring many decisions points, and each decision brings more room for mistakes. Furthermore, evaluating your work is less black-and-white than for some experiments. However, coding has the benefit that failure is risk free. No resources are wasted—not money, time (a student’s job is to learn!), or a scientific reputation. In silico, the playing field is leveled by hard work and conscientiousness. So, while programming can be intimidating, the most intimidating step is starting. Conclusion Markowetz recently wrote, “Computational biologists are just biologists using a different tool” [17]. If you are a traditionally trained biologist, we intend these 10 simple rules as instruction (and pep talk) to learn a new, powerful, and exciting tool. The learning curve can be steep; however, the effort will pay dividends. Computational experience will make you more marketable as a scientist (see “Top N Reasons To Do A Ph.D. or Post-Doc in Bioinformatics/Computational Biology” [18]). Computational research has fewer overhead costs and reduces the barrier to entry in transitioning fields [19], opening career doors to interested researchers. Perhaps most importantly, programming skills will make you better able to implement and interpret your own analyses and understand and respect analytical biases, making you a better experimentalist as well. Therefore, the time you spend at your computer is valuable. Acquiring programming expertise will make you a better biologist.

Related collections

Most cited references 13

Record: found
Abstract: found
Article: found

Is Open Access

Ten Simple Rules for Reproducible Computational Research

Geir Sandve, Anton Nekrutenko, James Nick Taylor … (2013)

Replication is the cornerstone of a cumulative science [1]. However, new tools and technologies, massive amounts of data, interdisciplinary approaches, and the complexity of the questions being asked are complicating replication efforts, as are increased pressures on scientists to advance their research [2]. As full replication of studies on independently collected data is often not feasible, there has recently been a call for reproducible research as an attainable minimum standard for assessing the value of scientific claims [3]. This requires that papers in experimental science describe the results and provide a sufficiently clear protocol to allow successful repetition and extension of analyses based on original data [4]. The importance of replication and reproducibility has recently been exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction [5], studies showing difficulties with replicating published experimental results [6], an increase in retracted papers [7], and through a high number of failing clinical trials [8], [9]. This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility. In order to foster such aspects, it has been suggested that the scientific community needs to develop a “culture of reproducibility” for computational science, and to require it for published claims [3]. We want to emphasize that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducibility can also be a burden for you as an individual researcher. As an example, a good practice of reproducibility is necessary in order to allow previously developed methodology to be effectively applied on new data, or to allow reuse of code and results for new projects. In other words, good habits of reproducibility may actually turn out to be a time-saver in the longer run. We further note that reproducibility is just as much about the habits that ensure reproducible research as the technologies that can make these processes efficient and realistic. Each of the following ten rules captures a specific aspect of reproducibility, and discusses what is needed in terms of information handling and tracking of procedures. If you are taking a bare-bones approach to bioinformatics analysis, i.e., running various custom scripts from the command line, you will probably need to handle each rule explicitly. If you are instead performing your analyses through an integrated framework (such as GenePattern [10], Galaxy [11], LONI pipeline [12], or Taverna [13]), the system may already provide full or partial support for most of the rules. What is needed on your part is then merely the knowledge of how to exploit these existing possibilities. In a pragmatic setting, with publication pressure and deadlines, one may face the need to make a trade-off between the ideals of reproducibility and the need to get the research out while it is still relevant. This trade-off becomes more important when considering that a large part of the analyses being tried out never end up yielding any results. However, frequently one will, with the wisdom of hindsight, contemplate the missed opportunity to ensure reproducibility, as it may already be too late to take the necessary notes from memory (or at least much more difficult than to do it while underway). We believe that the rewards of reproducibility will compensate for the risk of having spent valuable time developing an annotated catalog of analyses that turned out as blind alleys. As a minimal requirement, you should at least be able to reproduce the results yourself. This would satisfy the most basic requirements of sound research, allowing any substantial future questioning of the research to be met with a precise explanation. Although it may sound like a very weak requirement, even this level of reproducibility will often require a certain level of care in order to be met. There will for a given analysis be an exponential number of possible combinations of software versions, parameter values, pre-processing steps, and so on, meaning that a failure to take notes may make exact reproduction essentially impossible. With this basic level of reproducibility in place, there is much more that can be wished for. An obvious extension is to go from a level where you can reproduce results in case of a critical situation to a level where you can practically and routinely reuse your previous work and increase your productivity. A second extension is to ensure that peers have a practical possibility of reproducing your results, which can lead to increased trust in, interest for, and citations of your work [6], [14]. We here present ten simple rules for reproducibility of computational research. These rules can be at your disposal for whenever you want to make your research more accessible—be it for peers or for your future self. Rule 1: For Every Result, Keep Track of How It Was Produced Whenever a result may be of potential interest, keep track of how it was produced. When doing this, one will frequently find that getting from raw data to the final result involves many interrelated steps (single commands, scripts, programs). We refer to such a sequence of steps, whether it is automated or performed manually, as an analysis workflow. While the essential part of an analysis is often represented by only one of the steps, the full sequence of pre- and post-processing steps are often critical in order to reach the achieved result. For every involved step, you should ensure that every detail that may influence the execution of the step is recorded. If the step is performed by a computer program, the critical details include the name and version of the program, as well as the exact parameters and inputs that were used. Although manually noting the precise sequence of steps taken allows for an analysis to be reproduced, the documentation can easily get out of sync with how the analysis was really performed in its final version. By instead specifying the full analysis workflow in a form that allows for direct execution, one can ensure that the specification matches the analysis that was (subsequently) performed, and that the analysis can be reproduced by yourself or others in an automated way. Such executable descriptions [10] might come in the form of simple shell scripts or makefiles [15], [16] at the command line, or in the form of stored workflows in a workflow management system [10], [11], [13], [17], [18]. As a minimum, you should at least record sufficient details on programs, parameters, and manual procedures to allow yourself, in a year or so, to approximately reproduce the results. Rule 2: Avoid Manual Data Manipulation Steps Whenever possible, rely on the execution of programs instead of manual procedures to modify data. Such manual procedures are not only inefficient and error-prone, they are also difficult to reproduce. If working at the UNIX command line, manual modification of files can usually be replaced by the use of standard UNIX commands or small custom scripts. If working with integrated frameworks, there will typically be a quite rich collection of components for data manipulation. As an example, manual tweaking of data files to attain format compatibility should be replaced by format converters that can be reenacted and included into executable workflows. Other manual operations like the use of copy and paste between documents should also be avoided. If manual operations cannot be avoided, you should as a minimum note down which data files were modified or moved, and for what purpose. Rule 3: Archive the Exact Versions of All External Programs Used In order to exactly reproduce a given result, it may be necessary to use programs in the exact versions used originally. Also, as both input and output formats may change between versions, a newer version of a program may not even run without modifying its inputs. Even having noted which version was used of a given program, it is not always trivial to get hold of a program in anything but the current version. Archiving the exact versions of programs actually used may thus save a lot of hassle at later stages. In some cases, all that is needed is to store a single executable or source code file. In other cases, a given program may again have specific requirements to other installed programs/packages, or dependencies to specific operating system components. To ensure future availability, the only viable solution may then be to store a full virtual machine image of the operating system and program. As a minimum, you should note the exact names and versions of the main programs you use. Rule 4: Version Control All Custom Scripts Even the slightest change to a computer program can have large intended or unintended consequences. When a continually developed piece of code (typically a small script) has been used to generate a certain result, only that exact state of the script may be able to produce that exact output, even given the same input data and parameters. As also discussed for rules 3 and 6, exact reproduction of results may in certain situations be essential. If computer code is not systematically archived along its evolution, backtracking to a code state that gave a certain result may be a hopeless task. This can cast doubt on previous results, as it may be impossible to know if they were partly the result of a bug or otherwise unfortunate behavior. The standard solution to track evolution of code is to use a version control system [15], such as Subversion, Git, or Mercurial. These systems are relatively easy to set up and use, and may be used to systematically store the state of the code throughout development at any desired time granularity. As a minimum, you should archive copies of your scripts from time to time, so that you keep a rough record of the various states the code has taken during development. Rule 5: Record All Intermediate Results, When Possible in Standardized Formats In principle, as long as the full process used to produce a given result is tracked, all intermediate data can also be regenerated. In practice, having easily accessible intermediate results may be of great value. Quickly browsing through intermediate results can reveal discrepancies toward what is assumed, and can in this way uncover bugs or faulty interpretations that are not apparent in the final results. Secondly, it more directly reveals consequences of alternative programs and parameter choices at individual steps. Thirdly, when the full process is not readily executable, it allows parts of the process to be rerun. Fourthly, when reproducing results, it allows any experienced inconsistencies to be tracked to the steps where the problems arise. Fifth, it allows critical examination of the full process behind a result, without the need to have all executables operational. When possible, store such intermediate results in standardized formats. As a minimum, archive any intermediate result files that are produced when running an analysis (as long as the required storage space is not prohibitive). Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds Many analyses and predictions include some element of randomness, meaning the same program will typically give slightly different results every time it is executed (even when receiving identical inputs and parameters). However, given the same initial seed, all random numbers used in an analysis will be equal, thus giving identical results every time it is run. There is a large difference between observing that a result has been reproduced exactly or only approximately. While achieving equal results is a strong indication that a procedure has been reproduced exactly, it is often hard to conclude anything when achieving only approximately equal results. For analyses that involve random numbers, this means that the random seed should be recorded. This allows results to be reproduced exactly by providing the same seed to the random number generator in future runs. As a minimum, you should note which analysis steps involve randomness, so that a certain level of discrepancy can be anticipated when reproducing the results. Rule 7: Always Store Raw Data behind Plots From the time a figure is first generated to it being part of a published article, it is often modified several times. In some cases, such modifications are merely visual adjustments to improve readability, or to ensure visual consistency between figures. If raw data behind figures are stored in a systematic manner, so as to allow raw data for a given figure to be easily retrieved, one can simply modify the plotting procedure, instead of having to redo the whole analysis. An additional advantage of this is that if one really wants to read fine values in a figure, one can consult the raw numbers. In cases where plotting involves more than a direct visualization of underlying numbers, it can be useful to store both the underlying data and the processed values that are directly visualized. An example of this is the plotting of histograms, where both the values before binning (original data) and the counts per bin (heights of visualized bars) could be stored. When plotting is performed using a command-based system like R, it is convenient to also store the code used to make the plot. One can then apply slight modifications to these commands, instead of having to specify the plot from scratch. As a minimum, one should note which data formed the basis of a given plot and how this data could be reconstructed. Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected The final results that make it to an article, be it plots or tables, often represent highly summarized data. For instance, each value along a curve may in turn represent averages from an underlying distribution. In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying the summaries. A common but impractical way of doing this is to incorporate various debug outputs in the source code of scripts and programs. When the storage context allows, it is better to simply incorporate permanent output of all underlying data when a main result is generated, using a systematic naming convention to allow the full data underlying a given summarized value to be easily found. We find hypertext (i.e., html file output) to be particularly useful for this purpose. This allows summarized results to be generated along with links that can be very conveniently followed (by simply clicking) to the full data underlying each summarized value. When working with summarized results, you should as a minimum at least once generate, inspect, and validate the detailed values underlying the summaries. Rule 9: Connect Textual Statements to Underlying Results Throughout a typical research project, a range of different analyses are tried and interpretation of the results made. Although the results of analyses and their corresponding textual interpretations are clearly interconnected at the conceptual level, they tend to live quite separate lives in their representations: results usually live on a data area on a server or personal computer, while interpretations live in text documents in the form of personal notes or emails to collaborators. Such textual interpretations are not generally mere shadows of the results—they often involve viewing the results in light of other theories and results. As such, they carry extra information, while at the same time having their necessary support in a given result. If you want to reevaluate your previous interpretations, or allow peers to make their own assessment of claims you make in a scientific paper, you will have to connect a given textual statement (interpretation, claim, conclusion) to the precise results underlying the statement. Making this connection when it is needed may be difficult and error-prone, as it may be hard to locate the exact result underlying and supporting the statement from a large pool of different analyses with various versions. To allow efficient retrieval of details behind textual statements, we suggest that statements are connected to underlying results already from the time the statements are initially formulated (for instance in notes or emails). Such a connection can for instance be a simple file path to detailed results, or the ID of a result in an analysis framework, included within the text itself. For an even tighter integration, there are tools available to help integrate reproducible analyses directly into textual documents, such as Sweave [19], the GenePattern Word add-in [4], and Galaxy Pages [20]. These solutions can also subsequently be used in connection with publications, as discussed in the next rule. As a minimum, you should provide enough details along with your textual interpretations so as to allow the exact underlying results, or at least some related results, to be tracked down in the future. Rule 10: Provide Public Access to Scripts, Runs, and Results Last, but not least, all input data, scripts, versions, parameters, and intermediate results should be made publicly and easily accessible. Various solutions have now become available to make data sharing more convenient, standardized, and accessible in particular domains, such as for gene expression data [21]–[23]. Most journals allow articles to be supplemented with online material, and some journals have initiated further efforts for making data and code more integrated with publications [3], [24]. As a minimum, you should submit the main data and source code as supplementary material, and be prepared to respond to any requests for further data or methodology details by peers. Making reproducibility of your work by peers a realistic possibility sends a strong signal of quality, trustworthiness, and transparency. This could increase the quality and speed of the reviewing process on your work, the chances of your work getting published, and the chances of your work being taken further and cited by other researchers after publication [25].

0 comments Cited 271 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Ten Simple Rules for Taking Advantage of Git and GitHub

Yasset Perez-Riverol, Laurent Gatto, Rui Wang … (2016)

Introduction Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [1,2]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [3,4]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use. Box 1 By default, GitHub repositories are freely visible to all. Many projects decide to share their work publicly and openly from the start of the project in order to attract visibility and to benefit from contributions from the community early on. Some other groups prefer to work privately on projects until they are ready to share their work. Private repositories ensure that work is hidden but also limit collaborations to just those users who are given access to the repository. These repositories can then be made public at a later stage, such as, for example, upon submission, acceptance, or publication of corresponding journal articles. In some cases, when the collaboration was exclusively meant to be private, some repositories might never be made publicly accessible. GitHub relies, at its core, on the well-known and open-source version control system Git, originally designed by Linus Torvalds for the development of the Linux kernel and now developed and maintained by the Git community. One reason for GitHub’s success is that it offers more than a simple source code hosting service [5,6]. It provides developers and researchers with a dynamic and collaborative environment, often referred to as a social coding platform, that supports peer review, commenting, and discussion [7]. A diverse range of efforts, ranging from individual to large bioinformatics projects, laboratory repositories, as well as global collaborations, have found GitHub to be a productive place to share code and ideas and to collaborate (see Table 1). 10.1371/journal.pcbi.1004947.t001 Table 1 Bioinformatics repository examples with good practices of using GitHub. The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/. Name of the Repository Type URL Adam Community Project, Multiple forks https://github.com/bigdatagenomics/adam BioPython [18] Community Project, Multiple contributors https://github.com/biopython/biopython/graphs/contributors Computational Proteomics Unit Lab Repository https://github.com/ComputationalProteomicsUnit Galaxy Project [19] Community Project, Bioinformatics Repository https://github.com/galaxyproject/galaxy GitHub Paper Manuscript, Issue discussion, Community Project https://github.com/ypriverol/github-paper MSnbase [20] Individual project repository https://github.com/lgatto/MSnbase/ OpenMS [21] Bioinformatics Repository, Issue discussion, branches https://github.com/OpenMS/OpenMS/issues/1095 PRIDE Inspector Toolsuite [22] Project Organization, Multiple projects https://github.com/PRIDE-Toolsuite Retinal wave data repository [23] Individual project, Manuscript, Binary Data organized https://github.com/sje30/waverepo SAMtools [24] Bioinformatics Repository, Project Organization https://github.com/samtools rOpenSci Community Project, Issue discussion https://github.com/ropensci The Global Alliance For Genomics and Health Community Project https://github.com/ga4gh Some of the recommendations outlined below are broadly applicable to repository hosting services. However, our main aim is to highlight specific GitHub features. We provide a set of recommendations that we believe will help the reader to take full advantage of GitHub’s features for managing and promoting projects in bioinformatics as well as in many other research domains. The recommendations are ordered to reflect a typical development process: learning Git and GitHub basics, collaboration, use of branches and pull requests, labeling and tagging of code snapshots, tracking project bugs and enhancements using issues, and dissemination of the final results. Rule 1: Use GitHub to Track Your Projects The backbone of GitHub is the distributed version control system Git. Every change, from fixing a typo to a complete redesign of the software, is tracked and uniquely identified. Although Git has a complex set of commands and can be used for rather complex operations, learning to apply the basics requires only a handful of new concepts and commands and will provide a solid ground to efficiently track code and related content for research projects. Many introductory and detailed tutorials are available (see Table 2 below for a few examples). In particular, we recommend A Quick Introduction to Version Control with Git and GitHub by Blischak et al. [5]. 10.1371/journal.pcbi.1004947.t002 Table 2 Online courses, tutorials, and workshops about GitHub and Git for scientists. Name of the Material URL Git help and Git help -a Document, installed with Git Karl Broman’s Git/Github Guide http://kbroman.org/github_tutorial/ Version Control with GitVersion Control with Git http://swcarpentry.github.io/git-novice/ Introduction to Git http://git-scm.com/book/ch1-3.html Github Training https://training.github.com/ Github Guides https://guides.github.com/ Good Resources for Learning Git and GitHub https://help.github.com/articles/good-resources-for-learning-git-and-github/ Software Carpentry: Version Control with Git http://swcarpentry.github.io/git-novice/ In a nutshell, initializing a (local) repository (often abbreviated as repo) marks a directory as one to be tracked (Fig 1). All or parts of its content can be added explicitly to the list of files to track. 10.1371/journal.pcbi.1004947.g001 Fig 1 The structure of a GitHub-based project illustrating project structure and interactions with the community. cd project ## move into directory to be tracked git init ## initialize local repository ## add individual files such as project description, reports, source code git add README project.md code.R git commit -m "initial commit" ## saves the current local snapshot Subsequently, every change to the tracked files, once committed, will be recorded as a new revision, or snapshot, uniquely identifying the changes in all the modified files. Git is remarkably effective and efficient in archiving the complete history of a project by, among other things, storing only the differences between files. In addition to local copies of the repository, it is straightforward to create remote repositories on GitHub (called origin, with default branch master—see below) using the web interface, and then synchronize local and remote repositories. git push origin master ## push local changes to the remote repository git pull origin master ## pull remote changes into the local repository Following Tony Rossini’s advice in 2005 to “commit early, commit often, and commit in a repository from which we can easily roll-back your mistakes,” one can organize one’s work in small incremental changes. At any time, it is possible to go back to a previous version. In larger projects, multiple users are able to work on the same remote repository, with all contributions being recorded, restorable, and attributed to the author. Users usually track source code, text files, images, and small data files inside their repositories and generally do not track derived files such as build logs or compiled binaries (read Box 2 to see how to handle large binary files in GitHub). And, although the majority of GitHub repositories are used for software development, users can also keep text documents such as analysis reports and manuscripts (see, for example, the repository for this manuscript at https://github.com/ypriverol/github-paper). Box 2 Using GitHub or any similar versioning/tracking system is not a replacement for good project management; it is an extension, an improvement for good project and file managing (see for example [9]). One practical consideration when using GitHub, for example, is dealing with large binary files. Binary files such as images, videos, executable files, or many raw data used in bioinformatics, are stored as a single large entity in Git. As a result, every change, even if minimal, leads to a complete new copy of the file in the repository, producing large size increments and the inability to search (see https://help.github.com/articles/searching-code/) and compare file content across revisions. Git offers a Large File Storage (LFS) module that replaces such large files with pointers while the large binary file can be stored remotely, which results in smaller and faster repositories. Git LFS is also supported by GitHub, albeit with a space quota or for a fee, to retain your usual GitHub workflow (https://help.github.com/categories/managing-large-files/) (S1 File, Section 1). Due to its distributed design, each up-to-date local Git repository is an entire exact historical copy of everything that was committed—file changes, commit message logs, etc. These copies act as independent backups as well, present on each user’s storage device. Git can be considered to be fault-tolerant because of this, which is a win over centralized version control systems. If the remote GitHub server is unavailable, collaboration and work can continue between users, as opposed to centralized alternatives. The web interface offered by GitHub provides friendly tools to perform many basic operations and a gentle introduction to a more rich and complex set of functionalities. Various graphical user-interface-driven clients for managing Git and GitHub repositories are also available (https://www.git-scm.com/downloads/guis). Many editors and development environments such as, for example, the popular RStudio editor for the R programming language [8], directly integrate with code versioning using Git and GitHub. In addition, for remote Git repositories, GitHub provides its own features that will be described in subsequent rules (Fig 1). Rule 2: GitHub for Single Users, Teams, and Organizations Public projects on GitHub are visible to everyone, but write permission, i.e., the ability to directly modify the content of a repository, needs to be granted explicitly. As a repository owner, you can grant this right to other GitHub users. In addition to being owned by users, repositories can also be created and managed as part of teams and organizations. Project managers can structure projects to manage permissions at different levels: users, teams, and organizations. Users are the central element of GitHub as in any other social network. Every user has a profile listing their GitHub projects and activities, which can optionally be populated with personal information including name, email address, image, and webpage. To stay up to date with the activity of other users, one can follow their accounts (see also Rule 10). Collaboration can be achieved by simply adding a trusted Collaborator, thereby granting write access. However, development in large projects is usually done by teams of people within a larger organization. GitHub organizations are a great way to manage team-based access permissions for the individual projects of institutes, research labs, and large open-source projects that need multiple owners and administrators (Fig 1). We recommend that you, as an individual researcher, make your profile visible to other users and display all of the projects and organizations you are working in. Rule 3: Developing and Collaborating on New Features: Branching and Forking Anyone with a GitHub account can fork any repository they have access to. This will create a complete copy of the content of the repository, while retaining a link to the original “upstream” version. One can then start working on the same code base in one’s own fork (https://help.github.com/articles/fork-a-repo/) under their username (see, for example, https://github.com/ypriverol/github-paper/network/members for this work) or organization (see Rule 2). Forking a repository allows users to freely experiment with changes without affecting the original project and forms the basis of social coding. It allows anyone to develop and test novel features with existing code and offers the possibility of contributing novel features, bug fixes, and improvements to documentation back into the original upstream project (requested by opening an pull request) repository and becoming a contributor. Forking a repository and providing pull requests constitutes a simple method for collaboration inside loosely defined teams and over more formal organizational boundaries, with the original repository owner(s) retaining control over which external contributions are accepted. Once a pull request is opened for review and discussion, it usually results in additional insights and increased code quality [7]. Many contributors can work on the same repository at the same time without running into edit conflicts. There are multiple strategies for this, and the most common way is to use Git branches to separate different lines of development. Active development is often performed on a development branch and stable versions, i.e., those used for a software release, are kept in a master or release branch (see for example https://github.com/OpenMS/OpenMS/branches). In practice, developers often work concurrently on one or several features or improvements. To keep commits of the different features logically separated, distinct branches are typically used. Later, when development is complete and verified to work (i.e., none of the tests fail, see Rule 5), new features can be merged back into the development line or master branch. In addition, one can always pull the currently up-to-date master branch into a feature branch to adapt the feature to the changes in the master branch. When developing different features in parallel, there is a risk of applying incompatible changes in different branches/forks; these are said to become out of sync. Branches are just short-term departures from master. If you pull frequently, you will keep your copy of the repository up to date and you will have the opportunity to merge your changed code with others’ contributors, ideally without requiring you to manually address conflicts to bring the branches in sync again. Rule 4: Naming Branches and Commits: Tags and Semantic Versions Tags can be used to label versions during the development process. Version numbering should follow “semantic versioning” practice, with the format X.Y.Z., with X being the major, Y the minor, and Z the patch version of the release, including possible meta information, as described in http://semver.org/. This semantic versioning scheme provides users with coherent version numbers that document the extent (bug fixes or new functionality) and backwards compatibility of new releases. Correct labeling allows developers and users to easily recover older versions, compare them, or simply use them to reproduce results described in publications (see Rule 8). This approach also help to define a coherent software publication strategy. Rule 5: Let GitHub Do Some Tasks for You: Integrate The first rule of software development is that the code needs to be ready to use as soon as possible [10], to remain so during development, and that it should be well-documented and tested. In 2005, Martin Fowler defined the basic principles for continuous integration in software development [11]. These principles have become the main reference for best practices in continuous integration, providing the framework needed to deploy software and, in some way, also data. In addition to mere error-free execution, dedicated code testing is aimed at detecting possible bugs introduced by new features or changes in the code or dependencies, as well as detecting wrong results, often known as logic errors, in which the source code produces a different result than what was intended. Continuous integration provides a way to automatically and systematically run a series of tests to check integrity and performance of code, a task that can be automated through GitHub. GitHub offers a set of hooks (automatically executed scripts) that are run after each push to a repository, making it easier to follow the basic principles of continuous integration. The GitHub web hooks allow third-party platforms to access and interact with a GitHub repository and thus to automate post-processing tasks. Continuous integration can be achieved by Travis CI, a hosted continued integration platform that is free for all open-source projects. Travis CI builds and tests the source code using a plethora of options such as different platforms and interpreter versions (S1 File, Section 2). In addition, it offers notifications that allow your team and contributors to know if the new changes work and to prevent the introduction of errors in the code (for instance, when merging pull requests), making the repository always ready to use. Rule 6: Let GitHub Do More Tasks for You: Automate More than just code compilation and testing can be integrated into your software project: GitHub hooks can be used to automate numerous tasks to help improve the overall quality of your project. An important complement to successful test completion is to demonstrate that the tests sufficiently cover the existing code base. For this, the integration of Codecov is recommended. This service will report how much of the code base and which lines of code are being executed as part of your code tests. The Bioconductor project, for example, highly recommends that packages implement unit testing (S1 File, Section 2) to support developers in their package development and maintenance (http://bioconductor.org/developers/unitTesting-guidelines/) and systematically tests the coverage of all of its packages (https://codecov.io/github/Bioconductor-mirror/). One might also consider generating the documentation upon code/documentation modification (S1 File, Section 3). This implies that your projects provide comprehensive documentation so others can understand and contribute back to them. For Python or C/C++ code, automatic documentation generation can be done using sphinx and subsequently integrated into GitHub using “Read the Docs.” All of these platforms will create reports and badges (sometimes called shields) that can be included on your GitHub project page, helping to demonstrate that the content is of high quality and well-maintained. Rule 7: Use GitHub to Openly and Collaboratively Discuss, Address, and Close Issues GitHub issues are a great way to keep track of bugs, tasks, feature requests, and enhancements. While classical issue trackers are primarily intended to be used as bug trackers, in contrast, GitHub issue trackers follow a different philosophy: each tracker has its own section in every repository and can be used to trace bugs, new ideas, and enhancements by using a powerful tagging system. The main objective of issues in GitHub is promoting collaboration and providing context by using cross-references. Raising an issue does not require lengthy forms to be completed. It only requires a title and, preferably, at least a short description. Issues have very clear formatting and provide space for optional comments, which allow anyone with a Github account to provide feedback. For example, if the developer needs more information to be able to reproduce a bug, he or she can simply request it in a comment. Additional elements of issues are (i) color-coded labels that help to categorize and filter issues, (ii) milestones, and (iii) one assignee responsible for working on the issue. They help developers to filter and prioritize tasks and turn an issue tracker into a planning tool for their project. It is also possible for repository administrators to create issue and pull request templates (https://help.github.com/articles/helping-people-contribute-to-your-project/) (see Rule 3) to customize and standardize the information to be included when contributors open issues. GitHub issues are thus dynamic, and they pose a low entry barrier for users to report bugs and request features. A well-organized and tagged issue tracker helps new contributors and users to understand a project more deeply. As an example, one issue in the OpenMS repository (https://github.com/OpenMS/OpenMS/issues/1095) allowed the interaction of eight developers and attracted more than one hundred comments. Contributors can add figures, comments, and references to other issues and pull requests in the repository, as well as direct references to code. As another illustration of issues and their generic and wide application, we (https://github.com/ypriverol/github-paper/issues) and others (https://github.com/ropensci/RNeXML/issues/121) used GitHub issues to discuss and comment on changes in manuscripts and address reviewers’ comments. Rule 8: Make Your Code Easily Citable, and Cite Source Code! It is a good research practice to ensure permanent and unambiguous identifiers for citable items like articles, datasets, or biological entities such as proteins, genes, and metabolites (see also Box 3). Digital Object Identifiers (DOIs) have been used for many years as unique and unambiguous identifiers for enabling the citation of scientific publications. More recently, a trend has started to mint DOIs for other types of scientific products such as datasets [12] and training materials (for example [13]). A key motivation for this is to build a framework for giving scientists broader credit for their work [14,15] while simultaneously supporting clearer, more persistent ways to cite and track it. Helping to drive this change are funding agencies such as the National Institutes of Health (NIH) and National Science Foundation (NSF) in the United States and Research Councils in the United Kingdom, which are increasingly recognizing the importance of research products such as publicly available datasets and software. Box 3 Every repository should ideally have the following three files. The first and arguably most important file in a repository is a LICENCE file (see also Rule 8) that clearly defines the permissions and restrictions attached to the code and other files in your repository. The second important file is a README file, which provides, for example, a short description of the project, a quick start guide, information on how to contribute, a TODO list, and links to additional documentation. Such README files are typically written in markdown, a simple markup language that is automatically rendered on GitHub. Finally, a CITATION file to the repository informs your users how to cite and credit your project. A common issue with software is that it normally evolves at a different speed than text published in the scientific literature. In fact, it is common to find software having novel features and functionality that were not described in the original publication. GitHub now integrates with archiving services such as Zenodo and Figshare, enabling DOIs to be assigned to code repositories. The procedure is relatively straightforward (see https://guides.github.com/activities/citable-code/), requiring only the provision of metadata and a series of administrative steps. By default, Zenodo creates an archive of a repository each time a new release is created in GitHub, ensuring the cited code remains up to date. Once the DOI has been assigned, it can be added to literature information resources such as Europe PubMed Central [16]. As already mentioned in the introduction, reproducibility of scientific claims should be enabled by providing the software, the datasets, and the process leading to interpretable results that were used in a particular study. As much as possible, publications should highlight that the code is freely available in, for example, GitHub, together with any other relevant outputs that may have been deposited. In our experience, this openness substantially increases the chances of getting the paper accepted for publication. Journal editors and reviewers receive the opportunity to reproduce findings during the manuscript review process, increasing confidence in the reported results. In addition, once the paper is published, your work can be reproduced by other members of the scientific community, which can increase citations and foster opportunities for further discussion and collaboration. The availability of a public repository containing the source code does not make the software open-source per se. You should use an Open Source Initiative (OSI)-approved license that defines how the software can be freely used, modified, and shared. Common licenses such as those listed on http://choosealicense.com are preferred. Note that the LICENSE file in the repository should be a plain-text file containing the contents of an OSI-approved license, not just a reference to the license. Rule 9: Promote and Discuss Your Projects: Web Page and More The traditional way to promote scientific software is by publishing an associated paper in the peer-reviewed scientific literature, though, as pointed out by Buckheir and Donoho, this is just advertising [17]. Additional steps can boost the visibility of an organization. For example, GitHub Pages are simple websites freely hosted by GitHub. Users can create and host blog websites, help pages, manuals, tutorials, and websites related to specific projects. Pages comes with a powerful static site generator called Jekyll that can be integrated with other frameworks such as Bootstrap or platforms such as Disqus to support and moderate comments. In addition, several real-time communication platforms have been integrated with GitHub such as Gitter and Slack. Real-time communication systems allow the user community, developers, and project collaborators to exchange ideas and issues and to report bugs or get support. For example, Gitter is a GitHub-based chat tool that enables developers and users to share aspects of their work. Gitter inherits the network of social groups operating around GitHub repositories, organizations, and issues. It relies on identities within GitHub creating Internet Relay Chat (IRC)-like chat rooms for public and private projects. Within a Gitter chat, members can reference issues, comments, and pull requests. GitHub also supports wikis (which are version-controlled repositories themselves) for each repository, in which users can create and edit pages for documentation, examples, or general support. A different service is Gist, which represents a unique way to share code snippets, single files, parts of files, or full applications. Gists can be generated in two different ways: public gists that can be browsed and searched through Discover and secret gists that are hidden from search engines. One of the main features of Gist is the possibility of embedding code snippets in other applications, enabling users to embed gists in any text field that supports JavaScript. Rule 10: Use GitHub to Be Social: Follow and Watch In the same way researchers are following developments in their field, scientific programmers could follow publicly available projects that might benefit their research. GitHub enables this functionality by following other GitHub users (see also Rule 2) or watching the activity of projects, which is a common feature in many social media platforms. Take advantage of it as much as possible! Conclusions If you are involved in scientific research and have not used Git and GitHub before, we recommend that you explore its potential as soon as possible. As with many tools, a learning curve lays ahead, but several basic yet powerful features are accessible even to the beginner and may be applied to many different use-cases [6]. We anticipate the reward will be worth your effort. To conclude, we would like to recommend some examples of bioinformatics repositories in GitHub (Table 1) and some useful training materials, including workshops, online courses, and manuscripts (Table 2). Supporting Information S1 File Supplementary Information including three sections: Git Large File Storage (LFS), Testing Levels of the Source Code and Continuous integration, and Source code documentation. (PDF) Click here for additional data file.

0 comments Cited 60 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Gene name errors are widespread in the scientific literature

Mark Ziemann, Yotam Eren, Assam El-Osta (2016)

The spreadsheet software Microsoft Excel, when used with default settings, is known to convert gene names to dates and floating-point numbers. A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-1044-7) contains supplementary material, which is available to authorized users.

0 comments Cited 53 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Scott Markel: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput. Biol

Journal ID (publisher-id): plos

Journal ID (pmc): ploscomp

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Electronic): 4 January 2018

Publication date Collection: January 2018

Volume: 14

Issue: 1

Electronic Location Identifier: e1005871

Affiliations

[1 ] Department of Microbiology, Immunology, and Cancer Biology, University of Virginia School of Medicine, Charlottesville, Virginia, United States of America

[2 ] Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America

Dassault Systemes BIOVIA, UNITED STATES

Author notes

The authors have declared that no competing interests exist.

* E-mail: papin@ 123456virginia.edu

Jason A. Papin is co-Editor-in-Chief of PLOS Computational Biology.

Author information

Maureen A. Carey http://orcid.org/0000-0003-2890-5445

Article

Publisher ID: PCOMPBIOL-D-17-01424

DOI: 10.1371/journal.pcbi.1005871

PMC ID: 5754048

PubMed ID: 29300745

SO-VID: 2fd3c185-313c-4a76-a133-61b0b269e217

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Page count

Figures: 6, Tables: 1, Pages: 11

Funding

The authors received no specific funding for this work.

Comments

Comment on this article

scite_

Cited by 30

See all cited by

Most referenced authors 88

See all reference authors

Ten simple rules for biologists learning to program

Read this article at

Abstract

Related collections

Journal of Systems Thinking

Most cited references 13

Ten Simple Rules for Reproducible Computational Research

Ten Simple Rules for Taking Advantage of Git and GitHub

Gene name errors are widespread in the scientific literature

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 19

Cited by 30

Most referenced authors 88