3
views
0
recommends
+1 Recommend
3 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Ten simple rules to increase computational skills among biologists with Code Clubs

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Introduction For most biologists, the ability to generate data has outpaced the ability to analyze those data. High throughput data comes to us from DNA and RNA sequencing, flow cytometry, metabolomics, molecular screens, and more. Although some accept the approach of compartmentalizing data generation and data analysis, we have found scientists feel empowered when they can both ask and answer their own biological questions [1]. In our experience performing microbiome research, it is more common to find exceptional bench scientists who are inexperienced at analyzing large data sets than to find the reverse. Of course, this raises a challenge: how do we train bench scientists to effectively answer biological questions with these larger data sets? The standard undergraduate and graduate training in the biological sciences is typically insufficient for developing these skills [2]. To fill this void, there has been growth in online data science training resources through companies including Data Camp (https://www.datacamp.com) and Codecademy (https://www.codecademy.com), massive open online courses (MOOCs), and in-person workshops hosted by organizations, including The Carpentries (https://carpentries.org). Each of these formats are highly popular among their audience (e.g., [3]); however, recent analyses of workshops suggest that they have little impact on the long-term retention of the material that is taught [4]. This result should not be surprising. These learning formats ask participants to engage in massed practice, an approach that is known to be ineffective [5–8]. In contrast, participants need regular opportunities to engage in repeated deliberate practice that encourage them to practice their retrieval and application of the material [9,10]. There is clearly a need to develop a framework that can build upon material introduced in coursework and workshops and that can help researchers stay up to date on the latest best practices, algorithms, and tools. To address this need, we noted the similarity between the feeling of unease in analyzing the data and the struggles scientists face with engaging the voluminous scientific literature. A common strategy for keeping up with the literature is to participate in a Journal Club. Regardless of the discipline, a Journal Club involves group discussion of a preselected paper that can range from informal discussions to PowerPoint presentations and course credit. In addition to building upon material from traditional coursework and staying current on the literature, Journal Clubs help strengthen skills in critical thinking, communication, and integrating the literature [11]. Because most Journal Clubs occur on a regular schedule, they are effective by virtue of repeated practice. With this model in mind, over the past four years we have experimented with creating a Code Club model with the goal of improving reproducible data analysis skills in a laboratory environment. Our Code Club sessions are an hour long and alternate with our lab’s Journal Club as the second part of weekly two-hour lab meetings. Initially, the Code Club was used to review code from trainee projects. Instead of a presentation, the presenter would project their code onto a screen and the participants would go through the code, stating the logic behind each line. This approach emphasized the importance of code readability and gave beginners the opportunity to see the real-life, messy code of more experienced peers. Unfortunately, the format only allowed us to review a fraction of a project’s code, making it difficult to integrate the programmer’s logic across their full project. A major issue with this model was that sometimes beginners could not contribute to improving the code, and even when they could, more experienced group members would eventually dominate the discussion. This led to a lack of participation by beginners, who would sometimes mentally withdraw, resulting in an adversarial environment between the presenter and more experienced members. As a result, presenters were reluctant to offer to present again. From these experiences, we began a collective conversation to improve our Code Club model. We have identified two successful approaches. The first is a more constructive version of a group code review. The presenter clearly states the problem they want to solve, breaks the participants into smaller groups, and then asks each group to solve the problem, or a portion of it. For example, someone may have an R script with a repeating chunk of code. The challenge for the session would be to convert the code chunk into a function to be called throughout the script to make it “DRY” (i.e., Don’t Repeat Yourself [12]). The presenter leaves the session with several partial or working solutions to their problem, and the importance of writing DRY code is reinforced. The second approach is a tutorial. The presenter introduces a new package or technique and assigns an activity to practice the new approach. For example, at one Code Club, participants were given raw data and a finished plot. Paired participants were tasked with generating the plot from the data using R syntax from either the base language or the ggplot2 package. For this exercise, base R users had to use ggplot2 and vice versa. In either approach, the Code Club ends with a report back to the larger group describing the approach each pair took. We generally find that preparing a Code Club session takes similar effort to preparing a Journal Club presentation. Our Code Clubs typically have 7 to 10 participants, but the inherent “think-pair-share” approach should allow it to be scaled to groups of variable sizes [13]. We continue to experiment with approaches for running Code Club. For example, during the COVID-19 pandemic in 2020, when many research labs were shuttered, author PDS started posting Code Club materials (https://www.riffomonas.org/code_club/) that individuals or research groups can use to practice their coding skills. During this time, we continued our group’s Code Club sessions using the breakout room and screen sharing features in Zoom. Regardless of our approach to leading a Code Club, we have learned that it is critical for the presenter to clearly articulate their goals and facilitate participant engagement. Although some Code Club sessions may be more experimental than others, on the whole they are a critical tool to train bench scientists in reproducible data analysis practices. We have provided some examples of successful topics in Table 1. We have summarized the results of our own experimentation as Code Club presenters and participants into Ten Rules. The first 3 rules apply to all participants, Rules 4 through 8 are targeted to presenters, and the last two for non-presenters. As the rules describe, this model can easily be adapted to groups of non-biologists or those with stronger backgrounds in computer science. 10.1371/journal.pcbi.1008119.t001 Table 1 Examples of successful Code Club topics. Title Description base versus ggplot2 Given input data and a figure, recreate the figure using R’s base graphics or ggplot2 syntax Snakemake Given a bash script that contains an analysis pipeline, convert it to a Snakemake workflow (https://github.com/SchlossLab/snakemake_riffomonas_tutorial) DRYing code Given script with repeated code, create functions to remove repetition Vegan R package Compare microbial communities using the adonis function in the Vegan R package (https://github.com/SchlossLab/Code_Review_42717) tidy data Given a wide-formatted data table, convert it to a long, tidy-formatted data table using tools from R’s tidyverse GitFlow Participants file and claim an issue to add their name to a README file in a GitHub-hosted repository and file a pull request to complete the issue googledocs4 R package Scrape a Google docs workbook and clean the data to identify previous Code Club presenters Develop an R Package Convert a lab member’s collection of scripts into an R package over a series of sessions Documenting R code Use roxygen2 to supplement comments in R code to improve documentation (https://github.com/SchlossLab/documenting-R) gganimate R package Convert static plots generated with ggplot2 package into GIFs (https://github.com/SchlossLab/2020-04-12-CodeClub_PlotAnimation/) Rule 1: Reciprocate respect It is critical that the presenter and participants respect each other and that a designated individual (e.g., the lab director) enforces a code of conduct. Each member of the Code Club must have the humility to acknowledge that they have more to learn about any given topic. We have found that many problems are avoided when the presenter takes charge of the session with a clear lesson plan, thoughtfully creates groups, and gives encouragement. Similarly, participants foster a positive environment by remembering that the task is not a competition, instead focusing on the presenter’s goals, allowing their partner to contribute, asking clarifying questions when appropriate, and avoiding distractions (e.g., email, social media). Learning to program is challenging. Too often it is attempted in an environment with nonconstructive criticism. All parties in a Code Club are responsible for preventing this by demonstrating respect for themselves and their colleagues. Rule 2: Let the material change you Part of the humility required to participate in a Code Club is acknowledging that your training is incomplete and that it is possible for everyone to learn something new. For participants, assume that the presenter has a plan and follow their presented approach. After the Code Club, try to incorporate that material into new code or refactor old code. By practicing the material in a different context, you will learn the material better. For presenters, incorporate suggested changes into your code. Either party may identify concepts that they are unsure of, presenting opportunities for further conversation and learning. Rule 3: Experiment! The selection of content and structure for each Code Club works best when it is democratic and distributed. If someone thinks a technique is worth learning and wants to teach it, they have that power. If they want to experiment with a different format, they are free to try it out. Members of the Code Club need to feel like they have the power to shape the direction of the group. If members are following these Rules, they will naturally reflect on the skills and interests of the other members in the group. For example, there is always turnover in a research group, making it important to revisit basic concepts to teach to new people and provide a refresher to others. One successful experiment found that we could keep basic content interesting to more experienced group members by creating related problems that varied in their difficulty. The group should also feel free to experiment with the format and incorporate group feedback by ending with a debrief discussing the pros and cons of new formats. Rule 4: Set specific goals In our early Code Clubs, we noticed that if the presenter did not clearly state their goals for the session, it often led to frustration for both the presenter and participants. If the presenter shared their own code, did they want participants to focus on their coding style or did they want help incorporating a new package into their workflow? Participants will always notice or ask about code concepts that are not the focus of the exercise; a presenter with a specific goal can bring tangential conversations back to the planned task. Where possible, presenters should create a simplified scenario (i.e., minimal, reproducible example [14]), which can be helpful in focusing the participants. The presenter should verify ahead of time that the simplified example works and behaves the way they expect. Beyond the content, clear goals for participant activities will help both parties stay on task and avoid frustration. For more advanced learners, the presenter can create stretch goals or give an activity with multiple stopping points where participants would feel successful (e.g., commenting code, creating function, implementing function, refactoring function). Accomplishing specific goals is more likely to result in a positive outcome for presenters and participants. Rule 5: Keep it simple Our Code Club needs to fit within an hour time slot. When considering their Code Club activity, the presenter should plan for an introduction and brief instruction, time for participants to engage the material, and time for everyone to report back within that hour. A typical schedule for Code Club is 10 minutes of introduction and instruction, 30 minutes of paired programming, 5 minutes to get groups to wrap up, and 10 minutes to report back to the group. We once had a presenter try to teach basic, but unfamiliar, Julia syntax. Unfortunately, the time was up before the participants had installed the interpreter. Some tips to help keep it simple are to limit the presented code chunks to less than 50 lines or, conversely, consider the number of lines that might be required to accomplish a solution. Remember that learners may need up to three times as long to complete a task that is straightforward for the presenter, so Code Club is best kept simple. Rule 6: Give participants time to prepare Similar to a Journal Club, the presenter should give participants a few days (ideally a week) to prepare for the Code Club. Considering the compressed schedule described in Rule 5, asking participants to download materials beforehand is helpful to ensure a quick start. The presenter should provide the participants with instructions on how to install dependencies, download data, and get the initial code. This might also uncover weak points in the presenter’s plan and enable them to ensure that the materials work as intended before the Code Club. We have found that using GitHub repositories for each Code Club can help make information, scripts, and data easily available to participants. Although this can be convenient, introducing GitHub on top of the session’s activities can impose a significant cognitive load and frustration to those not already comfortable with GitHub. Perhaps a first Code Club could introduce using git and GitHub to engage in collaborative coding. A lower barrier entry point for the presenter is posting their code, data, and information in a lab meeting-dedicated Slack channel or via email. Whatever method is used, the presenter should be sure to communicate the topic and necessary materials with the participants ahead of time. Rule 7: Don’t give participants busywork Participants want to learn topics that will either be useful to them or help their colleague (i.e., the presenter). Presenters should do their best to satisfy those motivations, whether it is through the relevancy of the concept or the data. It does not make sense to present a Code Club on downloading stock market data if it is not useful or interesting to the group. Similarly, participants should not be tasked with improving the presenter’s code if the presenter has no intention of incorporating the suggestions. A list of packages or tasks that group members are interested in along with a log of previous topics could help a presenter struggling to find a topic choose one that will make a rewarding Code Club. Rule 8: Include all levels of participants As suggested by Rule 7, a significant challenge to presenting at Code Club is selecting topics and activities that appeal to a critical mass of the participants. This is particularly difficult if participants have a wide range of coding experience, which can follow a turnover in group membership. Beginners will benefit from sessions that cover fundamental concepts and functions. The benefits for more experienced participants include the improved understanding of concepts by teaching and breaking down problems into simpler elements. Furthermore, they can benefit by contributing to an environment of learning that they previously benefited from. In addition, instead of focusing on core functions, the group could balance Code Club sessions that cover basics with those that introduce new methods and packages to the group. We have also identified several strategies to overcome the challenges presented by participants at varying skill levels. Central to the Code Club format is the use of paired programming [15]. Instead of letting participants form their own pairs, the presenter can select pairs of participants with either similar or differing skill levels, depending on the presenter’s goals. Partnerships between those with similar skill levels require the presenter to design appropriate activities for each skill level. We have found that commenting code is a good skill for beginners because it forces them to dissect and understand a code chunk line by line. It also reinforces the value of commenting as they develop their skills and independence. An advantage of forming partnerships between people with disparate skill levels is that it is more likely for groups to provide the presenter with a diverse range of methods that achieve the same result. This approach to pairing also helps to graft new members that have emerging programming skills. Regardless of how partners are selected, consider asking the pairs to identify a navigator and a driver [16]. The driver types at the computer while the navigator tells them what to type, thus ensuring participation of both partners. Midway through the activity, the presenter can have the partners switch roles. Intentionally forming pairs can also engineer group interactions by avoiding potentially disruptive partnerships or pairing reliable role models with new group members. Rule 9: Prepare in advance to maximize participation It is not possible to fully participate in a Journal Club discussion about a paper that the participant has not read. In that context, coming to Code Club without having installed a dependency is similar to asking a simple question about the Journal Club paper without first reading it. Both instances show a lack of preparation. Just as a presenter must follow Rule 6 to provide materials ahead of time, participants must review the code in advance, download the data sets, install the necessary packages, and perhaps read up on the topic. If the Code Club is based on a paper or chapters in a data science book (e.g., [17–19]), the participants should read them before the session and consider how they might incorporate the concepts into their own work. Rule 10: Participate An essential ingredient of any Code Club is active participation from all parties. Having an open laptop on the table and permission to use it can feel like an invitation to get distracted by other work, emails, and browsing the internet; fight that urge and focus on the presenter’s goals. Be respectful and allow your partner to contribute. Speak up for yourself and force your partner to let you contribute. If the material seems too advanced for you, it can be frustrating, and tempting to mentally check out. Fortunately, programming languages like R and Python are generally expressive, which should allow you to engage with the logic, even if the syntax is too advanced. Oftentimes, understanding the logic of when to use one modeling approach over another is more important than knowing how to use it. If you understand the “why,” the “how” will quickly follow. More experienced participants should aim to communicate feedback and coding suggestions at a level that all participants can understand and engage in. Regardless of skill levels, your partner and the presenter put themselves in a vulnerable position by revealing what they do or do not know. Encourage them and show your gratitude for helping you learn something new by fully participating in each Code Club. Conclusion The most important rules are the first and last. Members of the Code Club need to feel comfortable with other group members and sufficiently empowered to try something new or ask for help. Aside from expanding our programming skills, we have noticed two other benefits that help create a positive culture. First, we have intentionally interviewed postdoctoral candidates on the days we hold Code Club. We make it clear that they are not being assessed on their coding skills, but instead use it as an opportunity to see how the candidate interacts with other members of the research group. At the same time, the candidate can learn about the culture of the research group through active participation. Second, members of other research groups have integrated themselves into our Code Club to minimize the isolation they feel in growing their skills within smaller research groups. This speaks to both the broader need for Code Club and the likelihood of success when expanded to include a larger group of individuals with broader research interests. There is no reason that the Code Club format would not work with groups independent of a research group as long as everyone follows the Rules. Finding data sets and applications that are interesting to a critical mass of people is essential to starting and sustaining such a group. Ultimately, Code Club has improved the overall data analysis skills, community, and research success of our lab by empowering researchers to seek help from their colleagues.

          Related collections

          Most cited references12

          • Record: found
          • Abstract: not found
          • Article: not found

          Using Spacing to Enhance Diverse Forms of Learning: Review of Recent Research and Implications for Instruction

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators

            In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC—acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A Quick Guide to Organizing Computational Biology Projects

              Introduction Most bioinformatics coursework focuses on algorithms, with perhaps some components devoted to learning programming skills and learning how to use existing bioinformatics software. Unfortunately, for students who are preparing for a research career, this type of curriculum fails to address many of the day-to-day organizational challenges associated with performing computational experiments. In practice, the principles behind organizing and documenting computational experiments are often learned on the fly, and this learning is strongly influenced by personal predilections as well as by chance interactions with collaborators or colleagues. The purpose of this article is to describe one good strategy for carrying out computational experiments. I will not describe profound issues such as how to formulate hypotheses, design experiments, or draw conclusions. Rather, I will focus on relatively mundane issues such as organizing files and directories and documenting progress. These issues are important because poor organizational choices can lead to significantly slower research progress. I do not claim that the strategies I outline here are optimal. These are simply the principles and practices that I have developed over 12 years of bioinformatics research, augmented with various suggestions from other researchers with whom I have discussed these issues. Principles The core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why. This “someone” could be any of a variety of people: someone who read your published article and wants to try to reproduce your work, a collaborator who wants to understand the details of your experiments, a future student working in your lab who wants to extend your work after you have moved on to a new job, your research advisor, who may be interested in understanding your work or who may be evaluating your research skills. Most commonly, however, that “someone” is you. A few months from now, you may not remember what you were up to when you created a particular set of files, or you may not remember what conclusions you drew. You will either have to then spend time reconstructing your previous experiments or lose whatever insights you gained from those experiments. This leads to the second principle, which is actually more like a version of Murphy's Law: Everything you do, you will probably have to do over again. Inevitably, you will discover some flaw in your initial preparation of the data being analyzed, or you will get access to new data, or you will decide that your parameterization of a particular model was not broad enough. This means that the experiment you did last week, or even the set of experiments you've been working on over the past month, will probably need to be redone. If you have organized and documented your work clearly, then repeating the experiment with the new data or the new parameterization will be much, much easier. To see how these two principles are applied in practice, let's begin by considering the organization of directories and files with respect to a particular project. File and Directory Organization When you begin a new project, you will need to decide upon some organizational structure for the relevant directories. It is generally a good idea to store all of the files relevant to one project under a common root directory. The exception to this rule is source code or scripts that are used in multiple projects. Each such program might have a project directory of its own. Within a given project, I use a top-level organization that is logical, with chronological organization at the next level, and logical organization below that. A sample project, called msms, is shown in Figure 1. At the root of most of my projects, I have a data directory for storing fixed data sets, a results directory for tracking computational experiments peformed on that data, a doc directory with one subdirectory per manuscript, and directories such as src for source code and bin for compiled binaries or scripts. 10.1371/journal.pcbi.1000424.g001 Figure 1 Directory structure for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset of the files are shown here. Note that the dates are formatted - - so that they can be sorted in chronological order. The source code src/ms-analysis.c is compiled to create bin/ms-analysis and is documented in doc/ms-analysis.html. The README files in the data directories specify who downloaded the data files from what URL on what date. The driver script results/2009-01-15/runall automatically generates the three subdirectories split1, split2, and split3, corresponding to three cross-validation splits. The bin/parse-sqt.py script is called by both of the runall driver scripts. Within the data and results directories, it is often tempting to apply a similar, logical organization. For example, you may have two or three data sets against which you plan to benchmark your algorithms, so you could create one directory for each of them under data. In my experience, this approach is risky, because the logical structure of your final set of experiments may look drastically different from the form you initially designed. This is particularly true under the results directory, where you may not even know in advance what kinds of experiments you will need to perform. If you try to give your directories logical names, you may end up with a very long list of directories with names that, six months from now, you no longer know how to interpret. Instead, I have found that organizing my data and results directories chronologically makes the most sense. Indeed, with this approach, the distinction between data and results may not be useful. Instead, one could imagine a top-level directory called something like experiments, with subdirectories with names like 2008-12-19. Optionally, the directory name might also include a word or two indicating the topic of the experiment therein. In practice, a single experiment will often require more than one day of work, and so you may end up working a few days or more before creating a new subdirectory. Later, when you or someone else wants to know what you did, the chronological structure of your work will be self-evident. Below a single experiment directory, the organization of files and directories is logical, and depends upon the structure of your experiment. In many simple experiments, you can keep all of your files in the current directory. If you start creating lots of files, then you should introduce some directory structure to store files of different types. This directory structure will typically be generated automatically from a driver script, as discussed below. The Lab Notebook In parallel with this chronological directory structure, I find it useful to maintain a chronologically organized lab notebook. This is a document that resides in the root of the results directory and that records your progress in detail. Entries in the notebook should be dated, and they should be relatively verbose, with links or embedded images or tables displaying the results of the experiments that you performed. In addition to describing precisely what you did, the notebook should record your observations, conclusions, and ideas for future work. Particularly when an experiment turns out badly, it is tempting simply to link the final plot or table of results and start a new experiment. Before doing that, it is important to document how you know the experiment failed, since the interpretation of your results may not be obvious to someone else reading your lab notebook. In addition to the primary text describing your experiments, it is often valuable to transcribe notes from conversations as well as e-mail text into the lab notebook. These types of entries provide a complete picture of the development of the project over time. In practice, I ask members of my research group to put their lab notebooks online, behind password protection if necessary. When I meet with a member of my lab or a project team, we can refer to the online lab notebook, focusing on the current entry but scrolling up to previous entries as necessary. The URL can also be provided to remote collaborators to give them status updates on the project. Note that if you would rather not create your own “home-brew” electronic notebook, several alternatives are available. For example, a variety of commercial software systems have been created to help scientists create and maintain electronic lab notebooks [1]–[3]. Furthermore, especially in the context of collaborations, storing the lab notebook on a wiki-based system or on a blog site may be appealing. Carrying Out a Single Experiment You have now created your directory structure, and you have created a directory for the current data, with the intention of carrying out a particular experiment in that directory. How do you proceed? The general principle is that you should record every operation that you perform, and make those operations as transparent and reproducible as possible. In practice, this means that I create either a README file, in which I store every command line that I used while performing the experiment, or a driver script (I usually call this runall) that carries out the entire experiment automatically. The choices that you make at this point will depend strongly upon what development environment you prefer. If you are working in a language such as Matlab or R, you may be able to store everything as a script in that language. If you are using compiled code, then you will need to store the command lines separately. Personally, I work in a combination of shell scripts, Python, and C. The appropriate mix of these three languages depends upon the complexity of the experiment. Whatever you decide, you should end up with a file that is parallel to the lab notebook entry. The lab notebook contains a prose description of the experiment, whereas the driver script contains all the gory details. Here are some rules of thumb that I try to follow when developing the driver script: Record every operation that you perform. Comment generously. The driver script typically involves little in the way of complicated logic, but often invokes various scripts that you have written, as well as a possibly eclectic collection of Unix utilities. Hence, for this type of script, a reasonable rule of thumb is that someone should be able to understand what you are doing solely from reading the comments. Note that I am refraining from advocating a particular mode of commenting for compiled code or more complex scripts—there are many schools of thought on the correct way to write such comments. Avoid editing intermediate files by hand. Doing so means that your script will only be semi-automatic, because the next time you run the experiment, you will have to redo the editing operation. Many simple editing operations can be performed using standard Unix utilities such as sed, awk, grep, head, tail, sort, cut, and paste. Store all file and directory names in this script. If the driver script calls other scripts or functions, then files and directory names should be passed from the driver script to these auxiliary scripts. Forcing all of the file and directory names to reside in one place makes it much easier to keep track of and modify the organization of your output files. Use relative pathnames to access other files within the same project. If you use absolute pathnames, then your script will not work for people who check out a copy of your project in their local directories (see “The Value of Version Control” below). Make the script restartable. I find it useful to embed long-running steps of the experiment in a loop of the form if ( ) then . If I want to rerun selected parts of the experiment, then I can delete the corresponding output files. For experiments that take a long time to run, I find it useful to be able to obtain a summary of the experiment's progress thus far. In these cases, I create two driver scripts, one to run the experiment (runall) and one to summarize the results (summarize). The final line of runall calls summarize, which in turn creates a plot, table, or HTML page that summarizes the results of the experiment. The summarize script is written in such a way that it can interpret a partially completed experiment, showing how much of the computation has been performed thus far. Handling and Preventing Errors During the development of a complicated set of experiments, you will introduce errors into your code. Such errors are inevitable, but they are particularly problematic if they are difficult to track down or, worse, if you don't know about them and hence draw invalid conclusions from your experiment. Here are three suggestions for error handling. First, write robust code to detect errors. Even in a simple script, you should check for bogus parameters, invalid input, etc. Whenever possible, use robust library functions to read standard file formats rather than writing ad hoc parsers. Second, when an error does occur, abort. I typically have my program print a message to standard error and then exit with a non-zero exit status. Such behavior might seem like it makes your program brittle; however, if you try to skip over the problematic case and continue on to the next step in the experiment, you run the risk that you will never notice the error. A corollary of this rule is that your code should always check the return codes of commands executed and functions called, and abort when a failure is observed. Third, whenever possible, create each output file using a temporary name, and then rename the file after it is complete. This allows you to easily make your scripts restartable and, more importantly, prevents partial results from being mistaken for full results. Command Lines versus Scripts versus Programs The design question that you will face most often as you formulate and execute a series of computational experiments is how much effort to put into software engineering. Depending upon your temperament, you may be tempted to execute a quick series of commands in order to test your hypothesis immediately, or you may be tempted to over-engineer your programs to carry out your experiment in a pleasingly automatic fashion. In practice, I find that a happy medium between these two often involves iterative improvement of scripts. An initial script is designed with minimal functionality and without the ability to restart in the middle of partially completed experiments. As the functionality of the script expands and the script is used more often, it may need to be broken into several scripts, or it may get “upgraded” from a simple shell script to Python, or, if memory or computational demands are too high, from Python to C or a mix thereof. In practice, therefore, the scripts that I write tend to fall into these four categories: Driver script. This is a top-level script; hence, each directory contains only one or two scripts of this type. Single-use script. This is a simple script designed for a single use. For example, the script might convert an arbitrarily formatted file associated with this project into a format used by some of your existing scripts. This type of script resides in the same directory as the driver script that calls it. Project-specific script. This type of script provides a generic functionality used by multiple experiments within the given project. I typically store such scripts in a directory immediately below the project root directory (e.g., the msms/bin/parse-sqt.py file in Figure 1). Multi-project script. Some functionality is generic enough to be useful across many projects. I maintain a set of these generic scripts, which perform functions such as extracting specified sequences from a FASTA file, generating an ROC curve, splitting a file for n-fold cross-validation, etc. Regardless of how general a script is supposed to be, it should have a clearly documented interface. In particular, every script or program, no matter how simple, should be able to produce a fairly detailed usage statement that makes it clear what the inputs and outputs are and what options are available. The Value of Version Control Version control software was originally developed to maintain and coordinate the development of complex software engineering projects. Modern version control systems such as Subversion are based on a central repository that stores all versions of a given collection of related files. Multiple individuals can “check out” a working copy of these files into their local directories, make changes, and then check the changes back into the central repository. I find version control software to be invaluable for managing computational experiments, for three reasons. First, the software provides a form of backup. Although our university computer systems are automatically backed up on a nightly basis, my laptop's backup schedule is more erratic. Furthermore, after mistakenly overwriting a file, it is often easier to retrieve yesterday's version from Subversion than to send an e-mail to the system administator. Indeed, one of my graduate students told me he would breathe a sigh of relief after typing svn commit, because that command stores a snapshot of his working directory in the central repository. Second, version control provides a historical record that can be useful for tracking down bugs or understanding old results. Typically, a script or program will evolve throughout the course of a project. Rather than storing many copies of the script with slightly different names, I rely upon the version control system to keep track of those versions. If I need to reproduce exactly an experiment that I performed three months ago, I can use the version control software to check out a copy of the state of my project at that time. Note that most version control software can also assign a logical “tag” to a particular state of the repository, allowing you to easily retrieve that state later. Third, and perhaps most significantly, version control is invaluable for collaborative projects. The repository allows collaborators to work simultaneously on a collection of files, including scripts, documentation, or a draft manuscript. If two individuals edit the same file in parallel, then the version control software will automatically merge the two versions and flag lines that were edited by both people. It is not uncommon, in the hours before a looming deadline, for me to talk by phone with a remote collaborator while we both edit the same document, checking in changes every few minutes. Although the basic idea of version control software seems straightforward, using a system such as Subversion effectively requires some discipline. First, version control software is most useful when it is used regularly. A good rule of thumb is that changes should be checked in at least once a day. This ensures that your historical record is complete and that a recent backup is always available if you mistakenly overwrite a file. If you are in the midst of editing code, and you have caused a once-compilable program to no longer work, it is possible to check in your changes on a “branch” of the project, effectively stating that this is a work in progress. Once the new functionality is implemented, then the branch can be merged back into the “trunk” of the project. Only then will your changes be propagated to other members of the project team. Second, version control should only be used for files that you edit by hand. Automatically generated files, whether they are compiled programs or the results of a computational experiment, do not belong under version control. These files tend to be large, so checking them into the project wastes disk space, both because they will be duplicated in the repository and in every working copy of the project, and also because these files will tend to change as you redo your experiment multiple times. Binary files are particularly wasteful: Because version control software operates on a line-by-line basis, the version history of a binary file is simply a complete copy of all versions of that file. There are exceptions to this rule, such as relatively small data files that will not change through the experiment, but these exceptions are rare. One practical difficulty with not checking in automatically generated files is that each time you issue an update command, the version control software is likely to complain about all of these files in your working directory that have not been checked in. To avoid scrolling through multiple screens of filenames at each update, Subversion and CVS provide functionality to tell the system to ignore certain files or types of files. Conclusion Many of the ideas outlined above have been described previously either in the context of computational biology or in general scientific computation. In particular, much has been written about the need to adopt sound software engineering principles and practices in the context of scientific software development. For example, Baxter et al. [4] propose a set of five “best practices” for scientific software projects, and Wilson [5] describes a variety of standard software engineering tools that can be used to make a computational scientist's life easier. Although many practical issues described above apply generally to any type of scientific computational research, working with biologists and biological data does present some of its own issues. For example, many biological data sets are stored in central data repositories. Basic record keeping—recording in the lab notebook the URL as well as the version number and download date for a given data set—may be sufficient to track simpler data sets. But for very large or dynamic data, it may be necessary to use a more sophisticated approach. For example, Boyle et al. [6] discuss how best to manage complex data repositories in the context of a scientific research program. In addition, the need to make results accessible to and understandable by wet lab biologists may have practical implications for how a project is managed. For example, to make the results more understandable, significant effort may need to go into the prose descriptions of experiments in the lab notebook, rather than simply including a figure or table with a few lines of text summarizing the major conclusion. More practically, differences in operating systems and software may cause logistical difficulties. For example, computer scientists may prefer to write their documents in the LaTeX typesetting language, whereas biologists may prefer Microsoft Word. As I mentioned in the Introduction, I intend this article to be more descriptive than prescriptive. Although I hope that some of the practices I describe above will prove useful for many readers, the most important take-home message is that the logistics of efficiently performing accurate, reproducible computational experiments is a subject worthy of consideration and discussion. Many relevant topics have not been covered here, including good coding practices, methods for automation of experiments, the logistics of writing a manuscript based on your experimental results, etc. I therefore encourage interested readers to post comments, suggestions, and critiques via the PLoS Computational Biology Web site.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                27 August 2020
                August 2020
                : 16
                : 8
                : e1008119
                Affiliations
                [1 ] Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, United States of America
                [2 ] Division of Medicinal Chemistry, Department of Pharmaceutical Sciences, University of Connecticut, Storrs, Connecticut, United States of America
                [3 ] Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
                [4 ] Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
                Carnegie Mellon University, UNITED STATES
                Author notes

                The authors have declared that no competing interests exist.

                [¤a]

                Current address: Alliance SciComm & Consulting, Linden, Michigan, United States of America

                [¤b]

                Current address: Benaroya Research Institute, Seattle, Washington, United States of America

                [¤c]

                Current address: Exploratory Science Center, Merck & Co., Inc., Cambridge, Massachusetts, United States of America

                [¤d]

                Current address: Roche Diagnostics, Clinical Operations Services and eSystems, Indianapolis, Indiana, United States of America

                [¤e]

                Current address: Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, United States of America

                [¤f]

                Current address: Genomics and Pharmacogenomics, Merck & Co., Inc., Cambridge, Massachusetts, United States of America

                Author information
                http://orcid.org/0000-0001-8481-1457
                http://orcid.org/0000-0001-9359-5194
                http://orcid.org/0000-0003-2374-4048
                http://orcid.org/0000-0001-5829-6754
                http://orcid.org/0000-0001-5284-5521
                http://orcid.org/0000-0003-4782-1802
                http://orcid.org/0000-0002-8248-1631
                http://orcid.org/0000-0003-1884-3543
                http://orcid.org/0000-0003-2322-4085
                http://orcid.org/0000-0003-3488-4169
                http://orcid.org/0000-0003-3283-829X
                http://orcid.org/0000-0002-3532-9653
                http://orcid.org/0000-0003-1638-5307
                http://orcid.org/0000-0003-3140-537X
                http://orcid.org/0000-0002-6935-4275
                Article
                PCOMPBIOL-D-20-00711
                10.1371/journal.pcbi.1008119
                7451508
                32853198
                b4a220a7-24e7-4ee3-a2c6-6b1832867a8a
                © 2020 Hagan et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                Page count
                Figures: 0, Tables: 1, Pages: 7
                Funding
                This work was supported, in part, by a grant from the US National Institutes of Health to PDS (R25GM116149). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Education
                Science Policy
                Science and Technology Workforce
                Careers in Research
                Scientists
                People and Places
                Population Groupings
                Professions
                Scientists
                Social Sciences
                Sociology
                Education
                Workshops
                Science Policy
                Science and Technology Workforce
                Careers in Research
                Scientists
                Biologists
                People and Places
                Population Groupings
                Professions
                Scientists
                Biologists
                Research and Analysis Methods
                Research Assessment
                Reproducibility
                Social Sciences
                Linguistics
                Grammar
                Syntax
                Biology and Life Sciences
                Neuroscience
                Cognitive Science
                Cognitive Psychology
                Learning
                Biology and Life Sciences
                Psychology
                Cognitive Psychology
                Learning
                Social Sciences
                Psychology
                Cognitive Psychology
                Learning
                Biology and Life Sciences
                Neuroscience
                Learning and Memory
                Learning
                Computer and Information Sciences
                Software Engineering
                Programming Languages
                Engineering and Technology
                Software Engineering
                Programming Languages
                Biology and life sciences
                Molecular biology
                Molecular biology techniques
                Sequencing techniques
                RNA sequencing
                Research and analysis methods
                Molecular biology techniques
                Sequencing techniques
                RNA sequencing

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article