Blog
About

4
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Ten simple rules for documenting scientific software

      *

      PLoS Computational Biology

      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Introduction Science, and especially biology, is increasingly relying on software tools to enable research. However, if you are a biologist, you likely received no training in software development best practices. Because of this lack of training, scientific software often has minimal or even nonexistent documentation, making the lives of researchers significantly harder than they need to be, with precious research time being spent figuring out how to use poorly documented software rather than performing the actual science. As the field matures, software documentation will become even more important as software stops being maintained and original authors are unable to be reached for support. Prior work has focused on various aspects of open software development [1–7], but documenting software has been underemphasized. I present these 10 simple rules in the hope that, by applying software engineering best practices to research tool documentation, you can create software that is maximally usable and impactful. Rule 1: Write comments as you code Comments are the single most important aspect of software documentation. At the end of the day, people (yourself included) need to be able to read and understand your source code. Good variable and function names can help immensely with readability, although they are no complete replacement for comments. Although it may be perfectly obvious to you what your code does without comments, other readers will likely not be so fortunate. Indeed, you yourself may not even be able to understand your own code after you've moved on to another project. Think of comments as your lab notebook: they help you remember your train of thought long after the fact. The best way to write comments is to do it as you code. That way you never have the problem of forgetting what your thought process was, and you never forget to go back and write the comments that you promised yourself you'd do (we're all guilty of this). Modern integrated development environments (IDEs) will often automatically generate documentation strings as you write code, which removes the burden of having to remember to write comments. One common argument against thorough code commenting is that it slows you down. In fact, good commenting can help you write code faster because you have a better understanding of your software. This understanding is especially useful when you run into bugs because you can compare what your code is doing to what your comments say it should be doing. Don’t forget that, at the end of the day, your code has the final word on what your software will do, not your comments. Proper code commenting is as much an art as it is a science. If you write too few comments, people won't be able to figure out what your code is doing. Write too many and readers will get lost in the sea of comments [4]. As a guiding principle, aim to write code that readers can understand purely by reading your comments [7]. If you remember one thing from this section, when in doubt, err on the side of more comments. To get a feel for the right amount of commenting for code, let’s examine some examples. Bad (no comments): for sequence in parsed_sequences:         analyze(sequence) Bad (too much commenting): # iterate over the genes in the genome for sequence in parsed_sequences:         # call the analyze function, passing it each gene as its argument         analyze(sequence) Good (just enough): # analyze the genome for sequence in parsed_sequences:         analyze(sequence) The key takeaway here is to keep your comments in the Goldilocks zone—not too many and not too few. Rule 2: Include examples (and lots of them) When it comes to software documentation, showing takes precedence over telling. There are several important reasons to include examples in your documentation beyond simple instruction. Examples provide a starting point for experimentation. By starting from a piece of code that works, your users can attempt to change it for their own uses with minimal difficulty. Unlike with comments, there isn’t such a thing as too many examples if they all show off different aspects of your software. If you find that your main documentation is getting too laden with examples, feel free to move them to a special section or directory so long as you keep your examples organized and easily discoverable. Keras, a machine learning framework, has 35 full example scripts as of the time of this writing (github.com/keras-team/keras/tree/master/examples) with a README (see Rule 4 for more) explaining what each example demonstrates. Although you are by no means under any obligation to provide that many examples, do take the time to at least write examples showing off the main functionality of your software [2]. You can even make your examples do double duty as unit tests (or vice versa), thereby verifying functionality while providing instruction. Rule 3: Include a quickstart guide Going from idea to experimentation to results as quickly as possible enables the progress of science. If people must spend a long time figuring out how to use your software, they're likely to give up. Conversely, if people can immediately start playing with your tool, they're vastly more likely to use it as a part of their research. It is therefore crucial to include a quickstart guide aimed at helping people begin using your software as quickly as possible. This can take the form of an example (see Rule 2), a tutorial, a video, or anything else you can imagine. For example, let’s look at the TPOT machine learning tool’s quickstart guide [8]: it has an animated graphic image file (GIF) showing the software’s functionality, diagrams explaining how it works, and a minimal code stub, perfect for copy-pasting into your own project. To tell whether your quickstart guide is working as intended, show it to someone who hasn't used your software and see if they can figure out how to start using it. Consider your quickstart guide to be a dating profile for your project: it should show off its strengths, give people a feel for it, and entice people into choosing it. Rule 4: Include a README file with basic information Your README file acts like a homepage for your project. On code-sharing sites like GitHub, Bitbucket, and GitLab, your README file is shown on your project's main page. README files should be easily readable from the raw source, so human-readable markup languages such as Markdown or reStructuredText (or plain text) are preferable to less readable formats like hypertext markup language (HTML). In fact, code-sharing sites will usually render your markup language on your repository's page, giving you the best of both worlds. Take advantage of this—free hosting is hard to come by and the fact that your hosted README page is on your repository makes the arrangement even sweeter. A good rule of thumb is to assume that the information contained within the README will be the only documentation your users read. For this reason, your README should include how to install and configure your software, where to find its full documentation, under what license it’s released, how to test it to ensure functionality, and acknowledgments. Furthermore, you should include your quickstart guide (as introduced in Rule 3) in your README. Often, the top of your README files will include badges that, when rendered, show the status of the software. One common source of badges is shields.io, which can dynamically generate badges for your project. Common badges include ones that show whether automated tests are passing (such those from travis-ci.org), what percentage of the code that the tests cover, whether the documentation is up to date, and more. Although not necessary, these badges instill confidence in the quality of your project and convey important information at a glance and are therefore highly recommended. Rule 5: Include a help command for command line interfaces Many scientific software tools have command line interfaces (CLIs). Not having a graphical interface saves development time and makes the software more flexible. However, one challenge that CLI software has is that it can be hard to figure out how to use. The best way to document CLIs is to have a “help” command that will print out how to use the software. That way, users don't need to try to find your documentation to get basic tasks done. It should include usage (how to use the command), subcommands (if applicable), options and/or arguments, environment variables (if applicable), and maybe even some examples (Rule 2 strikes again!). A help command can be tedious to make and difficult to maintain, but luckily there are numerous software packages that can do it for you. In Python, software such as click (click.pocoo.org) can not only make your help command but can also even help you make your interface, saving you time and effort. An example of a good CLI is the one included in Magic-BLAST. It has a short help command, “-h,” which provides basic information on what the tool is and how to use it. It also includes instructions on how to access the full help documentation, which include a list of every option as well a description of the option’s arguments and what it does. An arrangement like this is particularly good because it requires minimal effort to find just the most useful information via the short help page, thereby reducing information overload and reducing the cognitive load of using the software by providing a reminder of how to access the full CLI reference. Rule 6: Version control your documentation A previous Ten Simple Rules article has described the virtues of using Git for your code [1]. Because your documentation is such an integral part of your code, it must be version controlled as well. To start, you should keep your documentation inside your Git repository along with the rest of your files. This makes it possible to view your documentation at any point in the project's history. Services such as Read the Docs (readthedocs.org) and Zenodo (zenodo.org) make doing this even easier because they will archive a complete rendered version of your documentation every time you make a new release of your software. To illustrate why this is such an important rule, consider what would happen if you change a default setting in a new release of your software. When users of previous versions go to look at your documentation, they will see the documentation that is incompatible with the version that they have installed. Worse still, because you changed a default, the software could fail inexplicably. This can be incredibly aggravating to users (and even dangerous if the software is for mission-critical applications), so it is extra important to use version control for your documentation. A changelog in your documentation can make this task much easier. If you are using informative commit messages, creating a changelog is a straightforward task at worst and a trivial task at best. As an example of a bioinformatics library that is doing a particularly good job at version controlling their documentation, look at khmer, which has a thorough changelog containing new features, fixed bugs (separated by whether they are relevant to users or developers), known issues, and a list of the contributors to the release [9]. In addition, previous versions of the documentation website are easily accessible and labeled clearly. By providing this information, the authors have ensured that users of any version of the software can get the right version of the documentation, see what’s going on in the project, and make sure they’re aware of any issues with their version. If you take one thing away from this rule, make it very clear which version of your software your documentation is for and preserve previous versions of your documentation—your users will thank you. Rule 7: Fully document your application programming interface Your application programming interface (API) is how people who are using your software interact with your code. It is imperative that it be fully documented in the source code. In all honesty, probably nobody will read your entire API documentation, and that's perfectly fine. The goal of API documentation is to prevent users from having to dig into your (well commented, right?) source code to use your API. At the very least, each function should have its inputs and input types noted, its output and output type noted, and any errors it can raise documented. Objects should have their methods and attributes described. It’s best to use a consistent style for your API documentation. The Google style guide (google.github.io/styleguide) has API documentation suggestions for numerous languages such as Python, Java, R, C++, and Shell. You spent a lot of time developing your API; don’t let that time go to waste by not telling your users how to use it. Rule 8: Use automated documentation tools The best type of documentation is documentation that writes itself. Although no software package can do all your documentation for you (yet), there are tools that can do much of the heavy lifting, such as making a website, keeping it in sync with your code, and rendering it to a portable document file (PDF). Software such as Sphinx (sphinx-doc.org), perldoc, Javadoc, and Roxygen (https://github.com/klutometis/roxygen) for R can generate documentation and even read your comments and use those to generate detailed API documentation. Although Sphinx was developed to host Python’s documentation, it is language agnostic, meaning that it can work for whatever language your project is in. Similarly, Doxygen (doxygen.nl) and MkDocs (mkdocs.org) are language-agnostic documentation tools. Read the Docs, introduced in Rule 6, is a language-agnostic documentation hosting platform that can rebuild your documentation every time that you push to your repository, ensuring that your documentation is always up to date. There are many other ways automation can make your documentation smarter: in Python, software like doctest (sphinx-doc.org/en/stable/ext/doctest.html) can automatically pull examples from your documentation and ensure that your code does what you say it should be doing in the documentation. To help you follow Rule 7, there are tools such as Napoleon (github.com/sphinx-contrib/napoleon) that can generate your API documentation for you. It’s even possible to automatically generate interactive representational state transfer (REST) API documentation using free tools such as Swagger (swagger.io). At this point, there is almost no reason not to be using automated documentation tools. Rule 9: Write error messages that provide solutions or point to your documentation Error messages are part of life when developing software. As a developer, you should be doing your best to make your error messages as informative as possible. Good error messages should have three parts: they should state what the error is, what the state of the software was when it generated the error, and either how to fix it or where to find information relevant to fixing it. In the spirit of Rule 2, let's look at an example. Bad: Error: Translation failed. Good: Error: Translation failed because of an invalid codon ("IQT") in position 1 in sequence 41. Ensure that this is a valid DNA sequence and not a protein sequence. By showing what exactly went wrong and proposing a fix for it, your users will spend less time debugging and more time doing science. Since you know your software better than anyone else, providing guidance in error messages can be invaluable. If for no other reason, do it to save yourself the hassle of being tech support for your users (most of whom have barely read your documentation, if at all) when they run into easily fixable usage mistakes. Furthermore, it is important to say what the state of the software was when the error was generated, especially if it takes a long time to run or you don’t save logs by default. If your software fails, seemingly at random, after 12 hours of execution, your users will be thankful to know what was going on when the error was thrown rather than having to wait another 12 hours to reproduce the error with logging enabled. Rule 10: Tell people how to cite your software Of all the rules in this guide, odds are that this is the one you need the least. However, it must be said that, if you publish scientific software, you need to include the information required to properly provide attribution to your work. I recommend providing the digital object identifier (DOI), a BibTeX entry, and a written reference for your publication in your README, as well as using a “CITATION” file in citation file format (CFF) format, which is a human- and machine-readable file format designed for specifying citation information for scientific software [10]. Including citation information in your documentation is especially important for software that has not been published in a traditional academic journal, which would assign it a DOI. Just because your software is unpublished doesn't mean that you can't get a DOI for it—you deserve credit for your work. If you're using Zenodo to archive your releases (see Rule 6), it will mint a new DOI for each release as well as a DOI for the entire project. Another great, free way to get a DOI for your project is to submit it to the Journal of Open Source Software (joss.theoj.org), a peer-reviewed open-access academic journal designed for software developers. Both even provide a badge for your README (see Rule 4) so that the entire world can tell how to cite your software at a glance. Conclusion I hope that this guide will help you improve the quality of your software documentation. Documenting software is not always as exciting as doing original software development, but it is nonetheless as important. Software documentation is in many ways like writing a paper in that it is a required step in the dissemination of your ideas. It is a critical step for ensuring reproducibility, not to mention the fact that many bioinformatics journals now require that software submitted be well documented. Automated documentation tools such as Sphinx can diminish some of the effort required for good documentation, perhaps even making the work enjoyable. Finally, because documentation can make or break a project’s adoption in the real world, by following these 10 simple rules you can give your project its best chance of wide adoption and possibly even end up as an example of good documentation in a Ten Simple Rules article!

          Related collections

          Most cited references 8

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The khmer software package: enabling efficient nucleotide sequence analysis

          The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at  https://github.com/dib-lab/khmer/.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Ten Simple Rules for Taking Advantage of Git and GitHub

            Introduction Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [1,2]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [3,4]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use. Box 1 By default, GitHub repositories are freely visible to all. Many projects decide to share their work publicly and openly from the start of the project in order to attract visibility and to benefit from contributions from the community early on. Some other groups prefer to work privately on projects until they are ready to share their work. Private repositories ensure that work is hidden but also limit collaborations to just those users who are given access to the repository. These repositories can then be made public at a later stage, such as, for example, upon submission, acceptance, or publication of corresponding journal articles. In some cases, when the collaboration was exclusively meant to be private, some repositories might never be made publicly accessible. GitHub relies, at its core, on the well-known and open-source version control system Git, originally designed by Linus Torvalds for the development of the Linux kernel and now developed and maintained by the Git community. One reason for GitHub’s success is that it offers more than a simple source code hosting service [5,6]. It provides developers and researchers with a dynamic and collaborative environment, often referred to as a social coding platform, that supports peer review, commenting, and discussion [7]. A diverse range of efforts, ranging from individual to large bioinformatics projects, laboratory repositories, as well as global collaborations, have found GitHub to be a productive place to share code and ideas and to collaborate (see Table 1). 10.1371/journal.pcbi.1004947.t001 Table 1 Bioinformatics repository examples with good practices of using GitHub. The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/. Name of the Repository Type URL Adam Community Project, Multiple forks https://github.com/bigdatagenomics/adam BioPython [18] Community Project, Multiple contributors https://github.com/biopython/biopython/graphs/contributors Computational Proteomics Unit Lab Repository https://github.com/ComputationalProteomicsUnit Galaxy Project [19] Community Project, Bioinformatics Repository https://github.com/galaxyproject/galaxy GitHub Paper Manuscript, Issue discussion, Community Project https://github.com/ypriverol/github-paper MSnbase [20] Individual project repository https://github.com/lgatto/MSnbase/ OpenMS [21] Bioinformatics Repository, Issue discussion, branches https://github.com/OpenMS/OpenMS/issues/1095 PRIDE Inspector Toolsuite [22] Project Organization, Multiple projects https://github.com/PRIDE-Toolsuite Retinal wave data repository [23] Individual project, Manuscript, Binary Data organized https://github.com/sje30/waverepo SAMtools [24] Bioinformatics Repository, Project Organization https://github.com/samtools rOpenSci Community Project, Issue discussion https://github.com/ropensci The Global Alliance For Genomics and Health Community Project https://github.com/ga4gh Some of the recommendations outlined below are broadly applicable to repository hosting services. However, our main aim is to highlight specific GitHub features. We provide a set of recommendations that we believe will help the reader to take full advantage of GitHub’s features for managing and promoting projects in bioinformatics as well as in many other research domains. The recommendations are ordered to reflect a typical development process: learning Git and GitHub basics, collaboration, use of branches and pull requests, labeling and tagging of code snapshots, tracking project bugs and enhancements using issues, and dissemination of the final results. Rule 1: Use GitHub to Track Your Projects The backbone of GitHub is the distributed version control system Git. Every change, from fixing a typo to a complete redesign of the software, is tracked and uniquely identified. Although Git has a complex set of commands and can be used for rather complex operations, learning to apply the basics requires only a handful of new concepts and commands and will provide a solid ground to efficiently track code and related content for research projects. Many introductory and detailed tutorials are available (see Table 2 below for a few examples). In particular, we recommend A Quick Introduction to Version Control with Git and GitHub by Blischak et al. [5]. 10.1371/journal.pcbi.1004947.t002 Table 2 Online courses, tutorials, and workshops about GitHub and Git for scientists. Name of the Material URL Git help and Git help -a Document, installed with Git Karl Broman’s Git/Github Guide http://kbroman.org/github_tutorial/ Version Control with GitVersion Control with Git http://swcarpentry.github.io/git-novice/ Introduction to Git http://git-scm.com/book/ch1-3.html Github Training https://training.github.com/ Github Guides https://guides.github.com/ Good Resources for Learning Git and GitHub https://help.github.com/articles/good-resources-for-learning-git-and-github/ Software Carpentry: Version Control with Git http://swcarpentry.github.io/git-novice/ In a nutshell, initializing a (local) repository (often abbreviated as repo) marks a directory as one to be tracked (Fig 1). All or parts of its content can be added explicitly to the list of files to track. 10.1371/journal.pcbi.1004947.g001 Fig 1 The structure of a GitHub-based project illustrating project structure and interactions with the community. cd project ## move into directory to be tracked git init ## initialize local repository ## add individual files such as project description, reports, source code git add README project.md code.R git commit -m "initial commit" ## saves the current local snapshot Subsequently, every change to the tracked files, once committed, will be recorded as a new revision, or snapshot, uniquely identifying the changes in all the modified files. Git is remarkably effective and efficient in archiving the complete history of a project by, among other things, storing only the differences between files. In addition to local copies of the repository, it is straightforward to create remote repositories on GitHub (called origin, with default branch master—see below) using the web interface, and then synchronize local and remote repositories. git push origin master ## push local changes to the remote repository git pull origin master ## pull remote changes into the local repository Following Tony Rossini’s advice in 2005 to “commit early, commit often, and commit in a repository from which we can easily roll-back your mistakes,” one can organize one’s work in small incremental changes. At any time, it is possible to go back to a previous version. In larger projects, multiple users are able to work on the same remote repository, with all contributions being recorded, restorable, and attributed to the author. Users usually track source code, text files, images, and small data files inside their repositories and generally do not track derived files such as build logs or compiled binaries (read Box 2 to see how to handle large binary files in GitHub). And, although the majority of GitHub repositories are used for software development, users can also keep text documents such as analysis reports and manuscripts (see, for example, the repository for this manuscript at https://github.com/ypriverol/github-paper). Box 2 Using GitHub or any similar versioning/tracking system is not a replacement for good project management; it is an extension, an improvement for good project and file managing (see for example [9]). One practical consideration when using GitHub, for example, is dealing with large binary files. Binary files such as images, videos, executable files, or many raw data used in bioinformatics, are stored as a single large entity in Git. As a result, every change, even if minimal, leads to a complete new copy of the file in the repository, producing large size increments and the inability to search (see https://help.github.com/articles/searching-code/) and compare file content across revisions. Git offers a Large File Storage (LFS) module that replaces such large files with pointers while the large binary file can be stored remotely, which results in smaller and faster repositories. Git LFS is also supported by GitHub, albeit with a space quota or for a fee, to retain your usual GitHub workflow (https://help.github.com/categories/managing-large-files/) (S1 File, Section 1). Due to its distributed design, each up-to-date local Git repository is an entire exact historical copy of everything that was committed—file changes, commit message logs, etc. These copies act as independent backups as well, present on each user’s storage device. Git can be considered to be fault-tolerant because of this, which is a win over centralized version control systems. If the remote GitHub server is unavailable, collaboration and work can continue between users, as opposed to centralized alternatives. The web interface offered by GitHub provides friendly tools to perform many basic operations and a gentle introduction to a more rich and complex set of functionalities. Various graphical user-interface-driven clients for managing Git and GitHub repositories are also available (https://www.git-scm.com/downloads/guis). Many editors and development environments such as, for example, the popular RStudio editor for the R programming language [8], directly integrate with code versioning using Git and GitHub. In addition, for remote Git repositories, GitHub provides its own features that will be described in subsequent rules (Fig 1). Rule 2: GitHub for Single Users, Teams, and Organizations Public projects on GitHub are visible to everyone, but write permission, i.e., the ability to directly modify the content of a repository, needs to be granted explicitly. As a repository owner, you can grant this right to other GitHub users. In addition to being owned by users, repositories can also be created and managed as part of teams and organizations. Project managers can structure projects to manage permissions at different levels: users, teams, and organizations. Users are the central element of GitHub as in any other social network. Every user has a profile listing their GitHub projects and activities, which can optionally be populated with personal information including name, email address, image, and webpage. To stay up to date with the activity of other users, one can follow their accounts (see also Rule 10). Collaboration can be achieved by simply adding a trusted Collaborator, thereby granting write access. However, development in large projects is usually done by teams of people within a larger organization. GitHub organizations are a great way to manage team-based access permissions for the individual projects of institutes, research labs, and large open-source projects that need multiple owners and administrators (Fig 1). We recommend that you, as an individual researcher, make your profile visible to other users and display all of the projects and organizations you are working in. Rule 3: Developing and Collaborating on New Features: Branching and Forking Anyone with a GitHub account can fork any repository they have access to. This will create a complete copy of the content of the repository, while retaining a link to the original “upstream” version. One can then start working on the same code base in one’s own fork (https://help.github.com/articles/fork-a-repo/) under their username (see, for example, https://github.com/ypriverol/github-paper/network/members for this work) or organization (see Rule 2). Forking a repository allows users to freely experiment with changes without affecting the original project and forms the basis of social coding. It allows anyone to develop and test novel features with existing code and offers the possibility of contributing novel features, bug fixes, and improvements to documentation back into the original upstream project (requested by opening an pull request) repository and becoming a contributor. Forking a repository and providing pull requests constitutes a simple method for collaboration inside loosely defined teams and over more formal organizational boundaries, with the original repository owner(s) retaining control over which external contributions are accepted. Once a pull request is opened for review and discussion, it usually results in additional insights and increased code quality [7]. Many contributors can work on the same repository at the same time without running into edit conflicts. There are multiple strategies for this, and the most common way is to use Git branches to separate different lines of development. Active development is often performed on a development branch and stable versions, i.e., those used for a software release, are kept in a master or release branch (see for example https://github.com/OpenMS/OpenMS/branches). In practice, developers often work concurrently on one or several features or improvements. To keep commits of the different features logically separated, distinct branches are typically used. Later, when development is complete and verified to work (i.e., none of the tests fail, see Rule 5), new features can be merged back into the development line or master branch. In addition, one can always pull the currently up-to-date master branch into a feature branch to adapt the feature to the changes in the master branch. When developing different features in parallel, there is a risk of applying incompatible changes in different branches/forks; these are said to become out of sync. Branches are just short-term departures from master. If you pull frequently, you will keep your copy of the repository up to date and you will have the opportunity to merge your changed code with others’ contributors, ideally without requiring you to manually address conflicts to bring the branches in sync again. Rule 4: Naming Branches and Commits: Tags and Semantic Versions Tags can be used to label versions during the development process. Version numbering should follow “semantic versioning” practice, with the format X.Y.Z., with X being the major, Y the minor, and Z the patch version of the release, including possible meta information, as described in http://semver.org/. This semantic versioning scheme provides users with coherent version numbers that document the extent (bug fixes or new functionality) and backwards compatibility of new releases. Correct labeling allows developers and users to easily recover older versions, compare them, or simply use them to reproduce results described in publications (see Rule 8). This approach also help to define a coherent software publication strategy. Rule 5: Let GitHub Do Some Tasks for You: Integrate The first rule of software development is that the code needs to be ready to use as soon as possible [10], to remain so during development, and that it should be well-documented and tested. In 2005, Martin Fowler defined the basic principles for continuous integration in software development [11]. These principles have become the main reference for best practices in continuous integration, providing the framework needed to deploy software and, in some way, also data. In addition to mere error-free execution, dedicated code testing is aimed at detecting possible bugs introduced by new features or changes in the code or dependencies, as well as detecting wrong results, often known as logic errors, in which the source code produces a different result than what was intended. Continuous integration provides a way to automatically and systematically run a series of tests to check integrity and performance of code, a task that can be automated through GitHub. GitHub offers a set of hooks (automatically executed scripts) that are run after each push to a repository, making it easier to follow the basic principles of continuous integration. The GitHub web hooks allow third-party platforms to access and interact with a GitHub repository and thus to automate post-processing tasks. Continuous integration can be achieved by Travis CI, a hosted continued integration platform that is free for all open-source projects. Travis CI builds and tests the source code using a plethora of options such as different platforms and interpreter versions (S1 File, Section 2). In addition, it offers notifications that allow your team and contributors to know if the new changes work and to prevent the introduction of errors in the code (for instance, when merging pull requests), making the repository always ready to use. Rule 6: Let GitHub Do More Tasks for You: Automate More than just code compilation and testing can be integrated into your software project: GitHub hooks can be used to automate numerous tasks to help improve the overall quality of your project. An important complement to successful test completion is to demonstrate that the tests sufficiently cover the existing code base. For this, the integration of Codecov is recommended. This service will report how much of the code base and which lines of code are being executed as part of your code tests. The Bioconductor project, for example, highly recommends that packages implement unit testing (S1 File, Section 2) to support developers in their package development and maintenance (http://bioconductor.org/developers/unitTesting-guidelines/) and systematically tests the coverage of all of its packages (https://codecov.io/github/Bioconductor-mirror/). One might also consider generating the documentation upon code/documentation modification (S1 File, Section 3). This implies that your projects provide comprehensive documentation so others can understand and contribute back to them. For Python or C/C++ code, automatic documentation generation can be done using sphinx and subsequently integrated into GitHub using “Read the Docs.” All of these platforms will create reports and badges (sometimes called shields) that can be included on your GitHub project page, helping to demonstrate that the content is of high quality and well-maintained. Rule 7: Use GitHub to Openly and Collaboratively Discuss, Address, and Close Issues GitHub issues are a great way to keep track of bugs, tasks, feature requests, and enhancements. While classical issue trackers are primarily intended to be used as bug trackers, in contrast, GitHub issue trackers follow a different philosophy: each tracker has its own section in every repository and can be used to trace bugs, new ideas, and enhancements by using a powerful tagging system. The main objective of issues in GitHub is promoting collaboration and providing context by using cross-references. Raising an issue does not require lengthy forms to be completed. It only requires a title and, preferably, at least a short description. Issues have very clear formatting and provide space for optional comments, which allow anyone with a Github account to provide feedback. For example, if the developer needs more information to be able to reproduce a bug, he or she can simply request it in a comment. Additional elements of issues are (i) color-coded labels that help to categorize and filter issues, (ii) milestones, and (iii) one assignee responsible for working on the issue. They help developers to filter and prioritize tasks and turn an issue tracker into a planning tool for their project. It is also possible for repository administrators to create issue and pull request templates (https://help.github.com/articles/helping-people-contribute-to-your-project/) (see Rule 3) to customize and standardize the information to be included when contributors open issues. GitHub issues are thus dynamic, and they pose a low entry barrier for users to report bugs and request features. A well-organized and tagged issue tracker helps new contributors and users to understand a project more deeply. As an example, one issue in the OpenMS repository (https://github.com/OpenMS/OpenMS/issues/1095) allowed the interaction of eight developers and attracted more than one hundred comments. Contributors can add figures, comments, and references to other issues and pull requests in the repository, as well as direct references to code. As another illustration of issues and their generic and wide application, we (https://github.com/ypriverol/github-paper/issues) and others (https://github.com/ropensci/RNeXML/issues/121) used GitHub issues to discuss and comment on changes in manuscripts and address reviewers’ comments. Rule 8: Make Your Code Easily Citable, and Cite Source Code! It is a good research practice to ensure permanent and unambiguous identifiers for citable items like articles, datasets, or biological entities such as proteins, genes, and metabolites (see also Box 3). Digital Object Identifiers (DOIs) have been used for many years as unique and unambiguous identifiers for enabling the citation of scientific publications. More recently, a trend has started to mint DOIs for other types of scientific products such as datasets [12] and training materials (for example [13]). A key motivation for this is to build a framework for giving scientists broader credit for their work [14,15] while simultaneously supporting clearer, more persistent ways to cite and track it. Helping to drive this change are funding agencies such as the National Institutes of Health (NIH) and National Science Foundation (NSF) in the United States and Research Councils in the United Kingdom, which are increasingly recognizing the importance of research products such as publicly available datasets and software. Box 3 Every repository should ideally have the following three files. The first and arguably most important file in a repository is a LICENCE file (see also Rule 8) that clearly defines the permissions and restrictions attached to the code and other files in your repository. The second important file is a README file, which provides, for example, a short description of the project, a quick start guide, information on how to contribute, a TODO list, and links to additional documentation. Such README files are typically written in markdown, a simple markup language that is automatically rendered on GitHub. Finally, a CITATION file to the repository informs your users how to cite and credit your project. A common issue with software is that it normally evolves at a different speed than text published in the scientific literature. In fact, it is common to find software having novel features and functionality that were not described in the original publication. GitHub now integrates with archiving services such as Zenodo and Figshare, enabling DOIs to be assigned to code repositories. The procedure is relatively straightforward (see https://guides.github.com/activities/citable-code/), requiring only the provision of metadata and a series of administrative steps. By default, Zenodo creates an archive of a repository each time a new release is created in GitHub, ensuring the cited code remains up to date. Once the DOI has been assigned, it can be added to literature information resources such as Europe PubMed Central [16]. As already mentioned in the introduction, reproducibility of scientific claims should be enabled by providing the software, the datasets, and the process leading to interpretable results that were used in a particular study. As much as possible, publications should highlight that the code is freely available in, for example, GitHub, together with any other relevant outputs that may have been deposited. In our experience, this openness substantially increases the chances of getting the paper accepted for publication. Journal editors and reviewers receive the opportunity to reproduce findings during the manuscript review process, increasing confidence in the reported results. In addition, once the paper is published, your work can be reproduced by other members of the scientific community, which can increase citations and foster opportunities for further discussion and collaboration. The availability of a public repository containing the source code does not make the software open-source per se. You should use an Open Source Initiative (OSI)-approved license that defines how the software can be freely used, modified, and shared. Common licenses such as those listed on http://choosealicense.com are preferred. Note that the LICENSE file in the repository should be a plain-text file containing the contents of an OSI-approved license, not just a reference to the license. Rule 9: Promote and Discuss Your Projects: Web Page and More The traditional way to promote scientific software is by publishing an associated paper in the peer-reviewed scientific literature, though, as pointed out by Buckheir and Donoho, this is just advertising [17]. Additional steps can boost the visibility of an organization. For example, GitHub Pages are simple websites freely hosted by GitHub. Users can create and host blog websites, help pages, manuals, tutorials, and websites related to specific projects. Pages comes with a powerful static site generator called Jekyll that can be integrated with other frameworks such as Bootstrap or platforms such as Disqus to support and moderate comments. In addition, several real-time communication platforms have been integrated with GitHub such as Gitter and Slack. Real-time communication systems allow the user community, developers, and project collaborators to exchange ideas and issues and to report bugs or get support. For example, Gitter is a GitHub-based chat tool that enables developers and users to share aspects of their work. Gitter inherits the network of social groups operating around GitHub repositories, organizations, and issues. It relies on identities within GitHub creating Internet Relay Chat (IRC)-like chat rooms for public and private projects. Within a Gitter chat, members can reference issues, comments, and pull requests. GitHub also supports wikis (which are version-controlled repositories themselves) for each repository, in which users can create and edit pages for documentation, examples, or general support. A different service is Gist, which represents a unique way to share code snippets, single files, parts of files, or full applications. Gists can be generated in two different ways: public gists that can be browsed and searched through Discover and secret gists that are hidden from search engines. One of the main features of Gist is the possibility of embedding code snippets in other applications, enabling users to embed gists in any text field that supports JavaScript. Rule 10: Use GitHub to Be Social: Follow and Watch In the same way researchers are following developments in their field, scientific programmers could follow publicly available projects that might benefit their research. GitHub enables this functionality by following other GitHub users (see also Rule 2) or watching the activity of projects, which is a common feature in many social media platforms. Take advantage of it as much as possible! Conclusions If you are involved in scientific research and have not used Git and GitHub before, we recommend that you explore its potential as soon as possible. As with many tools, a learning curve lays ahead, but several basic yet powerful features are accessible even to the beginner and may be applied to many different use-cases [6]. We anticipate the reward will be worth your effort. To conclude, we would like to recommend some examples of bioinformatics repositories in GitHub (Table 1) and some useful training materials, including workshops, online courses, and manuscripts (Table 2). Supporting Information S1 File Supplementary Information including three sections: Git Large File Storage (LFS), Testing Levels of the Source Code and Continuous integration, and Source code documentation. (PDF) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A Quick Introduction to Version Control with Git and GitHub

              “This is part of the PLOS Computational Biology Education collection.” Introduction to Version Control Many scientists write code as part of their research. Just as experiments are logged in laboratory notebooks, it is important to document the code you use for analysis. However, a few key problems can arise when iteratively developing code that make it difficult to document and track which code version was used to create each result. First, you often need to experiment with new ideas, such as adding new features to a script or increasing the speed of a slow step, but you do not want to risk breaking the currently working code. One often-utilized solution is to make a copy of the script before making new edits. However, this can quickly become a problem because it clutters your file system with uninformative filenames, e.g., analysis.sh, analysis_02.sh, analysis_03.sh, etc. It is difficult to remember the differences between the versions of the files and, more importantly, which version you used to produce specific results, especially if you return to the code months later. Second, you will likely share your code with multiple lab mates or collaborators, and they may have suggestions on how to improve it. If you email the code to multiple people, you will have to manually incorporate all the changes each of them sends. Fortunately, software engineers have already developed software to manage these issues: version control. A version control system (VCS) allows you to track the iterative changes you make to your code. Thus, you can experiment with new ideas but always have the option to revert to a specific past version of the code you used to generate particular results. Furthermore, you can record messages as you save each successive version so that you (or anyone else) reviewing the development history of the code is able to understand the rationale for the given edits. It also facilitates collaboration. Using a VCS, your collaborators can make and save changes to the code, and you can automatically incorporate these changes to the main code base. The collaborative aspect is enhanced with the emergence of websites that host version-controlled code. In this quick guide, we introduce you to one VCS, Git (https://git-scm.com), and one online hosting site, GitHub (https://github.com), both of which are currently popular among scientists and programmers in general. More importantly, we hope to convince you that although mastering a given VCS takes time, you can already achieve great benefits by getting started using a few simple commands. Furthermore, not only does using a VCS solve many common problems when writing code, it can also improve the scientific process. By tracking your code development with a VCS and hosting it online, you are performing science that is more transparent, reproducible, and open to collaboration [1,2]. There is no reason this framework needs to be limited only to code; a VCS is well-suited for tracking any plain-text files: manuscripts, electronic lab notebooks, protocols, etc. Version Your Code The first step is to learn how to version your own code. In this tutorial, we will run Git from the command line of the Unix shell. Thus, we expect readers are already comfortable with navigating a filesystem and running basic commands in such an environment. You can find directions for installing Git for the operating system running on your computer by following one of the links provided in Table 1. There are many graphical user interfaces (GUIs) available for running Git (Table 1), which we encourage you to explore, but learning to use Git on the command line is necessary for performing more advanced operations and using Git on a remote machine. 10.1371/journal.pcbi.1004668.t001 Table 1 Resources. Resource Options Distributed VCS Git (https://git-scm.com) Mercurial (https://mercurial.selenic.com) Bazaar (http://bazaar.canonical.com) Online hosting site GitHub (https://github.com) Bitbucket (https://bitbucket.org) GitLab (https://about.gitlab.com) Source Forge (http://sourceforge.net) Git installation https://git-scm.com/downloads Git tutorials Software Carpentry (https://swcarpentry.github.io/git-novice) Pro Git (https://git-scm.com/book) A Visual Git Reference (https://marklodato.github.io/visual-git-guide) tryGit (https://try.github.io) Graphical User Interface for Git https://git-scm.com/downloads/guis To follow along, first create a folder in your home directory named thesis. Next, download the three files provided in Supporting Information and place them in the thesis directory. Imagine that, as part of your thesis, you are studying the transcription factor CTCF, and you want to identify high-confidence binding sites in kidney epithelial cells. To do this, you will utilize publicly available ChIP-seq data produced by the ENCODE consortium [3]. ChIP-seq is a method for finding the sites in the genome where a transcription factor is bound, and these sites are referred to as peaks [4]. process.sh downloads the ENCODE CTCF ChIP-seq data from multiple types of kidney samples and calls peaks (S1 Data); clean.py filters peaks with a fold change cutoff and merges peaks from the different kidney samples (S2 Data); and analyze.R creates diagnostic plots on the length of the peaks and their distribution across the genome (S3 Data). If you have just installed Git, the first thing you need to do is provide some information about yourself, since it records who makes each change to the file(s). Set your name and email by running the following lines, but replacing “First Last” and “user@domain” with your full name and email address, respectively. $ git config --global user.name "First Last" $ git config --global user.email "user@domain" To start versioning your code with Git, navigate to your newly created directory, ~/thesis. Run the command git init to initialize the current folder as a Git repository (Figs 1 and 2A). A repository (or repo, for short) refers to the current version of the tracked files as well as all the previously saved versions (Box 1). Only files that are located within this directory (and any subdirectories) have the potential to be version controlled, i.e., Git ignores all files outside of the initialized directory. For this reason, projects under version control tend to be stored within a single directory to correspond with a single Git repository. For strategies on how to best organize your own projects, see Noble, 2009 [5]. $ cd ~/thesis $ ls analyze.R clean.py process.sh $ git init Initialized empty Git repository in ~/thesis/.git/ 10.1371/journal.pcbi.1004668.g001 Fig 1 The git add/commit process. To store a snapshot of changes in your repository, first git add any files to the staging area you wish to commit (for example, you’ve updated the process.sh file). Second, type git commit with a message. Only files added to the staging area will be committed. All past commits are located in the hidden .git directory in your repository. 10.1371/journal.pcbi.1004668.g002 Fig 2 Working with a local repository. (A) To designate a directory on your computer as a Git repo, type the command git init. This initializes the repository and will allow you to track the files located within that directory. (B) Once you have added a file, follow the git add/commit cycle to place the new file first into the staging area by typing git add to designate it to be committed, and then git commit to take the shapshot of that file. The commit is assigned a commit identifier (d75es) that can be used in the future to pull up this version or to compare different committed versions of this file. (C) As you continue to add and change files, you should regularly add and commit those changes. Here, an additional commit was done, and the commit log now shows two commit identifiers: d75es (from step B) and f658t (the new commit). Each commit will generate a unique identifier, which can be examined in reverse chronological order using git log. Box 1. Definitions Version Control System (VCS): (noun) a program that tracks changes to specified files over time and maintains a library of all past versions of those files Git: (noun) a version control system repository (repo): (noun) folder containing all tracked files as well as the version control history commit: (noun) a snapshot of changes made to the staged file(s); (verb) to save a snapshot of changes made to the staged file(s) stage: (noun) the staging area holds the files to be included in the next commit; (verb) to mark a file to be included in the next commit track: (noun) a tracked file is one that is recognized by the Git repository branch: (noun) a parallel version of the files in a repository (Box 7) local: (noun) the version of your repository that is stored on your personal computer remote: (noun) the version of your repository that is stored on a remote server; for instance, on GitHub clone: (verb) to create a local copy of a remote repository on your personal computer fork: (noun) a copy of another user’s repository on GitHub; (verb) to copy a repository; for instance, from one user’s GitHub account to your own merge: (verb) to update files by incorporating the changes introduced in new commits pull: (verb) to retrieve commits from a remote repository and merge them into a local repository push: (verb) to send commits from a local repository to a remote repository pull request: (noun) a message sent by one GitHub user to merge the commits in their remote repository into another user’s remote repository Now you are ready to start versioning your code (Fig 1). Conceptually, Git saves snapshots of the changes you make to your files whenever you instruct it to. For instance, after you edit a script in your text editor, you save the updated script to your thesis folder. If you tell Git to save a shapshot of the updated document, then you will have a permanent record of the file in that exact version even if you make subsequent edits to the file. In the Git framework, any changes you have made to a script but have not yet recorded as a snapshot with Git reside in the working directory only (Fig 1). To follow what Git is doing as you record the initial version of your files, use the informative command git status. $ git status On branch master Initial commit Untracked files:     (use "git add …" to include in what will be committed)         analyze.R         clean.py         process.sh nothing added to commit but untracked files present (use "git add" to track) There are a few key things to notice from this output. First, the three scripts are recognized as untracked files because you have not told Git to start tracking anything yet. Second, the word “commit” is Git terminology for a snapshot. As a noun, it means “a version of the code,” e.g., “the figure was generated using the commit from yesterday” (Box 1). This word can also be used as a verb, meaning “to save,” e.g., “to commit a change.” Lastly, the output explains how you can track your files using git add. Start tracking the file process.sh. $ git add process.sh And check its new status. $ git status On branch master Initial commit Changes to be committed:     (use "git rm --cached …" to unstage)         new file: process.sh Untracked files:     (use "git add …" to include in what will be committed)         analyze.R         clean.py Since this is the first time that you have told Git about the file process.sh, two key things have happened. First, this file is now being tracked, which means Git recognizes it as a file you wish to be version controlled (Box 1). Second, the changes made to the file (in this case the entire file, because it is the first commit) have been added to the staging area (Fig 1). Adding a file to the staging area will result in the changes to that file being included in the next commit, or snapshot, of the code (Box 1). As an analogy, adding files to the staging area is like putting things in a box to mail off, and committing is like putting the box in the mail. Since this will be the first commit, or first version, of the code, use git add to begin tracking the other two files and add their changes to the staging area as well. Then create the first commit using the command git commit. $ git add clean.py analyze.R $ git commit -m "Add initial version of thesis code." [master (root-commit) 660213b] Add initial version of thesis code. 3 files changed, 154 insertions(+) create mode 100644 analyze.R create mode 100644 clean.py create mode 100644 process.sh Notice the flag -m was used to pass a message for the commit. This message describes the changes that have been made to the code and is required. If you do not pass a message at the command line, the default text editor for your system will open so you can enter the message. You have just performed the typical development cycle with Git: make some changes, add updated files to the staging area, and commit the changes as a snapshot once you are satisfied with them (Fig 2). Since Git records all of the commits, you can always look through the complete history of a project. To view the record of your commits, use the command git log. For each commit, it lists the unique identifier for that revision, author, date, and commit message. $ git log commit 660213b91af167d992885e45ab19f585f02d4661 Author: First Last Date: Fri Aug 21 14:52:05 2015–0500     Add initial version of thesis code. The commit identifier can be used to compare two different versions of a file, restore a file to a previous version from a past commit, and even retrieve tracked files if you accidentally delete them. Now you are free to make changes to the files knowing that you can always revert them to the state of this commit by referencing its identifier. As an example, edit clean.py so that the fold change cutoff for filtering peaks is more stringent. Here is the current bottom of the file. $ tail clean.py # Filter based on fold-change over control sample fc_cutoff = 10 epithelial = epithelial.filter(filter_fold_change, fc = fc_cutoff).saveas() proximal_tube = proximal_tube.filter(filter_fold_change, fc = fc_cutoff).saveas() kidney = kidney.filter(filter_fold_change, fc = fc_cutoff).saveas() # Identify only those sites that are peaks in all three tissue types combined = pybedtools.BedTool().multi_intersect(     i = [epithelial.fn, proximal_tube.fn, kidney.fn]) union = combined.filter(lambda x: int(x[3]) = = 3).saveas() union.cut(range(3)).saveas(data + "/sites-union.bed") Using a text editor, increase the fold change cutoff from 10 to 20. $ tail clean.py # Filter based on fold-change over control sample fc_cutoff = 20 epithelial = epithelial.filter(filter_fold_change, fc = fc_cutoff).saveas() proximal_tube = proximal_tube.filter(filter_fold_change, fc = fc_cutoff).saveas() kidney = kidney.filter(filter_fold_change, fc = fc_cutoff).saveas() # Identify only those sites that are peaks in all three tissue types combined = pybedtools.BedTool().multi_intersect(     i = [epithelial.fn, proximal_tube.fn, kidney.fn]) union = combined.filter(lambda x: int(x[3]) = = 3).saveas() union.cut(range(3)).saveas(data + "/sites-union.bed") Because Git is tracking clean.py, it recognizes that the file has been changed since the last commit. $ git status # On branch master # Changes not staged for commit: #    (use "git add …" to update what will be committed) #    (use "git checkout -- …" to discard changes in working directory) # #    modified: clean.py # no changes added to commit (use "git add" and/or "git commit -a") The report from git status indicates that the changes to clean.py are not staged, i.e., they are in the working directory (Fig 1). To view the unstaged changes, run the command git diff. $ git diff diff --git a/clean.py b/clean.py index 7b8c058.76d84ce 100644 --- a/clean.py +++ b/clean.py @@ -28,7 +28,7 @@ def filter_fold_change(feature, fc = 1):     return False # Filter based on fold-change over control sample -fc_cutoff = 10 +fc_cutoff = 20 epithelial = epithelial.filter(filter_fold_change, fc = fc_cutoff).saveas() proximal_tube = proximal_tube.filter(filter_fold_change, fc = fc_cutoff).saveas() kidney = kidney.filter(filter_fold_change, fc = fc_cutoff).saveas() Any lines of text that have been added to the script are indicated with a +, and any lines that have been removed with a -. Here, we altered the line of code that sets the value of fc_cutoff. git diff displays this change as the previous line being removed and a new line being added with our update incorporated. You can ignore the first five lines of output, because they are directions for other software programs that can merge changes to files. If you wanted to keep this edit, you could add clean.py to the staging area using git add and then commit the change using git commit, as you did above. Instead, this time undo the edit by following the directions from the output of git status to “discard changes in the working directory” using the command git checkout. $ git checkout -- clean.py $ git diff Now git diff returns no output, because git checkout undid the unstaged edit you had made to clean.py. This ability to undo past edits to a file is not limited to unstaged changes in the working directory. If you had committed multiple changes to the file clean.py and then decided you wanted the original version from the initial commit, you could replace the argument -- with the commit identifier of the first commit you made above (your commit identifier will be different; use git log to find it). The -- used above was simply a placeholder for the first argument because, by default, git checkout restores the most recent version of the file from the staging area (if you haven’t staged any changes to this file, as is the case here, the version of the file in the staging area is identical to the version in the last commit). Instead of using the entire commit identifier, use only the first seven characters, which is simply a convention, since this is usually long enough for it to be unique. $ git checkout 660213b clean.py At this point, you have learned the commands needed to version your code with Git. Thus, you already have the benefits of being able to make edits to files without copying them first, to create a record of your changes with accompanying messages, and to revert to previous versions of the files if needed. Now you will always be able to recreate past results that were generated with previous versions of the code (see the command git tag for a method to facilitate finding specific past versions) and see the exact changes you have made over the course of a project. Share Your Code Once you have your files saved in a Git repository, you can share it with your collaborators and the wider scientific community by putting your code online (Fig 3). This also has the added benefit of creating a backup of your scripts and provides a mechanism for transferring your files across multiple computers. Sharing a repository is made easier if you use one of the many online services that host Git repositories (Table 1), e.g., GitHub. Note, however, that any files that have not been tracked with at least one commit are not included in the Git repository, even if they are located within the same directory on your local computer (see Box 2 for advice on the types of files that should not be versioned with Git and Box 3 for advice on managing large files). 10.1371/journal.pcbi.1004668.g003 Fig 3 Working with both a local and remote repository as a single user. (A) On your computer, you commit to a Git repository (commit d75es). (B) On GitHub, you create a new repository called thesis. This repository is currently empty and not linked to the repo on your local machine. (C) The command git remote add connects your local repository to your remote repository. The remote repository is still empty, however, because you have not pushed any content to it. (D) You send all the local commits to the remote repository using the command git push. Only files that have been committed will appear in the remote repository. (E) You repeat several more rounds of updating scripts and committing on your local computer (commit f658t and then commit xv871). You have not yet pushed these commits to the remote repository, so only the previously pushed commit is in the remote repo (commit d75es). (F) To bring the remote repository up to date with your local repository, you git push the two new commits to the remote repository. The local and remote repositories now contain the same files and commit histories. Box 2. What Not to Version Control You can version control any file that you put in a Git repository, whether it is text-based, an image, or a giant data file. However, just because you can version control something, does not mean you should. Git works best for plain, text-based documents such as your scripts or your manuscript if written in LaTeX or Markdown. This is because for text files, Git saves the entire file only the first time you commit it and then saves just your changes with each commit. This takes up very little space, and Git has the capability to compare between versions (using git diff). You can commit a non-text file, but a full copy of the file will be saved in each commit that modifies it. Over time, you may find the size of your repository growing very quickly. A good rule of thumb is to version control anything text-based: your scripts or manuscripts if they are written in plain text. Things not to version control are large data files that never change, binary files (including Word and Excel documents), and the output of your code. In addition to the type of file, you need to consider the content of the file. If you plan on sharing your commits publicly using GitHub, ensure you are not committing any files that contain sensitive information, such as human subject data or passwords. To prevent accidentally committing files you do not wish to track, and to remove them from the output of git status, you can create a file called .gitignore. In this file, you can list subdirectories and/or file patterns that Git should ignore. For example, if your code produced log files with the file extension .log, you could instruct Git to ignore these files by adding *.log to .gitignore. In order for these settings to be applied to all instances of the repository, e.g., if you clone it onto another computer, you need to add and commit this file. Box 3. Managing Large Files Many biological applications require handling large data files. While Git is best suited for collaboratively writing small text files, nonetheless, collaboratively working on projects in the biological sciences necessitates managing this data. The example analysis pipeline in this tutorial starts by downloading data files in BAM format that contain the alignments of short reads from a ChIP-seq experiment to the human genome. Since these large, binary files are not going to change, there is no reason to version them with Git. Thus, hosting them on a remote http (as ENCODE has done in this case) or ftp site allows each collaborator to download it to her machine as needed, e.g., using wget, curl, or rsync. If the data files for your project are smaller, you could also share them via services like Dropbox (www.dropbox.com) or Google Drive (https://www.google.com/drive/). However, some intermediate data files may change over time, and the practical necessity to ensure all collaborators are using the same data set may override the advice to not put code output under version control, as described in Box 2. Again, returning to the ChIP-seq example, the first step calling the peaks is the most difficult computationally because it requires access to a Unix-like environment and sufficient computational resources. Thus, for collaborators that want to experiment with clean.py and analyze.R without having to run process.sh, you could version the data files containing the ChIP-seq peaks (which are in BED format). But since these files are larger than those typically used with Git, you can instead use one of the solutions for versioning large files within a Git repository without actually saving the file with Git, e.g., git-annex (https://git-annex.branchable.com/) or git-fat (https://github.com/jedbrown/git-fat/). Recently, GitHub has created their own solution for managing large files called Git Large File Storage (LFS) (https://git-lfs.github.com/). Instead of committing the entire large file to Git, which quickly becomes unmanageable, it commits a text pointer. This text pointer refers to a specific file saved on a remote GitHub server. Thus, when you clone a repository, it only downloads the latest version of the large file. If you check out an older version of the repository, it automatically downloads the old version of the large file from the remote server. After installing Git LFS, you can manage all the BED files with one command: git lfs track "*.bed". Then you can commit the BED files just like your scripts, and they will automatically be handled with Git LFS. Now, if you were to change the parameters of the peak calling algorithm and re-run process.sh, you could commit the updated BED files, and your collaborators could pull the new versions of the files directly to their local Git repositories. Below, we focus on the technical aspects of sharing your code. However, there are also other issues to consider when deciding if and how you are going to make your code available to others. For quick advice on these subjects, see Box 4 on how to license your code, Box 5 on concerns about being scooped, and Box 6 on the increasing trend of journals to institute sharing policies that require authors to deposit code in a public archive upon publication. Box 4. Choosing a License Putting software and other material in a public place is not the same as making it publicly usable. In order to do that, the authors must also add a license, since copyright laws in some jurisdictions require people to treat anything that isn’t explicitly open as being proprietary. While dozens of open licenses have been created, the two most widely used are the GNU Public License (GPL) and the MIT/BSD family of licenses. Of these, the MIT/BSD-style licenses put the fewest requirements on re-use, and thereby make it easier for people to integrate your software into their projects. For an excellent short discussion of these issues, and links to more information, see Jake Vanderplas’s blog post from March 2014 at http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing-scientific-code/. For a more in-depth discussion of the legal implications of different licenses, see Morin et al., 2012 [6]. Box 5. Being Scooped One concern scientists frequently have about putting work in progress online is that they will be scooped, e.g., that someone will analyze their data and publish a result that they themselves would have, but hadn’t yet. In practice, though, this happens rarely, if at all: in fact, the authors are not aware of a single case in which this has actually happened, and would welcome pointers to specific instances. In practice, it seems more likely that making work public early in something like a version control repository, which automatically adds timestamps to content, will help researchers establish their priority. Box 6. Journal Policies Sharing data, code, and other materials is quickly moving from “desired” to “required.” For example, PLOS’s sharing policy (http://journals.plos.org/plosone/s/materials-and-software-sharing) already says, “We expect that all researchers submitting to PLOS will make all relevant materials that may be reasonably requested by others available without restrictions upon publication of the work.” Its policy on software is more specific: We expect that all researchers submitting to PLOS submissions in which software is the central part of the manuscript will make all relevant software available without restrictions upon publication of the work. Authors must ensure that software remains usable over time regardless of versions or upgrades… It then goes on to specify that software must be based on open source standards, and that it must be put in an archive which is large or long-lived. Granting agencies, philanthropic foundations, and other major sponsors of scientific research are all moving in the same direction, and, to our knowledge, none has relaxed or reduced sharing requirements in the last decade. To begin using GitHub, you will first need to sign up for an account. For the code examples in this tutorial, you will need to replace username with the username of your account. Next, choose the option to “Create a new repository” (Fig 3B, see https://help.github.com/articles/create-a-repo/). Call it “thesis,” because that is the directory name containing the files on your computer, but note that you can give it a different name on GitHub if you wish. Also, now that the code will exist in multiple places, you need to learn some more terminology (Box 1). A local repository refers to code that is stored on the machine you are using, e.g., your laptop; whereas a remote repository refers to the code that is hosted online. Thus, you have just created a remote repository. Now you need to send the code on your computer to GitHub. The key to this is the URL that GitHub assigns your newly created remote repository. It will have the form https://github.com/username/thesis.git (see https://help.github.com/articles/cloning-a-repository/). Notice that this URL is using the HTTPS protocol, which is the quickest to begin using. However, it requires you to enter your username and password when communicating with GitHub, so you’ll want to consider switching to the SSH protocol once you are regularly using Git and GitHub (see https://help.github.com/articles/generating-ssh-keys/ for directions). In order to link the local thesis repository on your computer to the remote repository you just created, in your local repository, you need to tell Git the URL of the remote repository using the command git remote add (Fig 3C). $ git remote add origin https://github.com/username/thesis.git The name “origin” is a bookmark for the remote repository so that you do not have to type out the full URL every time you transfer your changes (this is the default name for a remote repository, but you could use another name if you like). Send your code to GitHub using the command git push (Fig 3D). $ git push origin master You first specify the remote repository, “origin.” Second, you tell Git to push to the “master” copy of the repository—we will not go into other options in this tutorial, but Box 7 discusses them briefly. Box 7. Branching Do you ever make changes to your code, but are not sure you will want to keep those changes for your final analysis? Or do you need to implement new features while still providing a stable version of the code for others to use? Using Git, you can maintain parallel versions of your code that you can easily bounce between while you are working on your changes. You can think of it like making a copy of the folder you keep your scripts in, so that you have your original scripts intact but also have the new folder where you make changes. Using Git, this is called branching, and it is better than separate folders because (1) it uses a fraction of the space on your computer, (2) it keeps a record of when you made the parallel copy (branch) and what you have done on the branch, and (3) there is a way to incorporate those changes back into your main code if you decide to keep your changes (and a way to deal with conflicts). By default, your repository will start with one branch, usually called “master.” To create a new branch in your repository, type git branch new_branch_name. You can see what branches a current repository has by typing git branch, with the branch you are currently in being marked by a star. To move between branches, type git checkout branch_to_move_to. You can edit files and commit them on each branch separately. If you want to combine the changes in your new branch with the master branch, you can merge the branches by typing git merge new_branch_name while in the master branch. Pushing to GitHub also has the added benefit of backing up your code in case anything were to happen to your computer. Also, it can be used to manually transfer your code across multiple machines, similar to a service like Dropbox (www.dropbox.com) but with the added capabilities and control of Git. For example, what if you wanted to work on your code on your computer at home? You can download the Git repository using the command git clone. $ git clone https://github.com/username/thesis.git By default, this will download the Git repository into a local directory named “thesis.” Furthermore, the remote “origin” will automatically be added so that you can easily push your changes back to GitHub. You now have copies of your repository on your work computer, your GitHub account online, and your home computer. You can make changes, commit them on your home computer, and send those commits to the remote repository with git push, just as you did on your work computer. Then the next day back at your work computer, you could update the code with the changes you made the previous evening using the command git pull. $ git pull origin master This pulls in all the commits that you had previously pushed to the GitHub remote repository from your home computer. In this workflow, you are essentially collaborating with yourself as you work from multiple computers. If you are working on a project with just one or two other collaborators, you could extend this workflow so that they could edit the code in the same way. You can do this by adding them as Collaborators on your repository (Settings -> Collaborators -> Add collaborator; see https://help.github.com/articles/adding-collaborators-to-a-personal-repository/). However, with projects with lots of contributors, GitHub provides a workflow for finer-grained control of the code development. With the addition of a GitHub account and a few commands for sending and receiving code, you can now share your code with others, transfer your code across multiple machines, and set up simple collaborative workflows. Contribute to Other Projects Lots of scientific software is hosted online in Git repositories. Now that you know the basics of Git, you can directly contribute to developing the scientific software you use for your research (Fig 4). From a small contribution like fixing a typo in the documentation to a larger change such as fixing a bug, it is empowering to be able to improve the software used by yourself and many other scientists. 10.1371/journal.pcbi.1004668.g004 Fig 4 Contributing to open source projects. We would like you to add an empty file that is named after your GitHub username to the repo used to write this manuscript. (A) Using your internet browser, navigate to https://github.com/jdblischak/git-for-science. (B) Click on the “Fork” button to create a copy of this repo on GitHub under your username. (C) On your computer, type git clone https://github.com/username/git-for-science.git, which will create a copy of git-for-science on your local machine. (D) Navigate to the readers directory by typing cd git-for-science/readers/. Create an empty file that is titled with your GitHub username by typing touch username.txt. Commit that new file by adding it to the staging area (git add username.txt) and committing with a message (git commit -m "Add username to directory of readers."). Note that your commit identifier will be different than what is shown here. (E) You have committed your new file locally, and the next step is to push that new commit up to the git-for-science repo under your username on GitHub. To do so, type git push origin master. (F) To request to add your commits to the original git-for-science repo, issue a pull request from the git-for-science repo under your username on GitHub. Once your Pull Request is reviewed and accepted, you will be able to see the file you committed with your username in the original git-for-science repository. When contributing to a larger project with many contributors, you will not be able to push your changes with git push directly to the project’s remote repository. Instead, you will first need to create your own remote copy of the repository, which on GitHub is called a fork (Box 1). You can fork any repository on GitHub by clicking the button “Fork” on the top right of the page (see https://help.github.com/articles/fork-a-repo/). Once you have a fork of a project’s repository, you can clone it to your computer and make changes just like a repository you created yourself. As an exercise, you will add a file to the repository that we used to write this paper. First, go to https://github.com/jdblischak/git-for-science and choose the “Fork” option to create a git-for-science repository under your GitHub account (Fig 4B). In order to make changes, download it to your computer with the command git clone from the directory you wish the repo to appear in (Fig 4C). $ git clone https://github.com/username/git-for-science.git Now that you have a local version, navigate to the subdirectory readers and create a text file named as your GitHub username (Fig 4D). $ cd git-for-science/readers $ touch username.txt Add and commit this new file (Fig 4D), and then push the changes back to your remote repository on GitHub (Fig 4E). $ git add username.txt $ git commit -m "Add username to directory of readers." $ git push origin master Currently, the new file you created, readers/username.txt, only exists in your fork of git-for-science. To merge this file into the main repository, send a pull request using the GitHub interface (Pull request -> New pull request -> Create pull request; Fig 4F; see https://help.github.com/articles/using-pull-requests/). After the pull request is created, we can review your change and then merge it into the main repository. Although this process of forking a project’s repository and issuing a pull request seems like a lot of work to contribute changes, this workflow gives the owner of a project control over what changes get incorporated into the code. You can have others contribute to your projects using the same workflow. The ability to use Git to contribute changes is very powerful because it allows you to improve the software that is used by many other scientists and also potentially shape the future direction of its development. Conclusion Git, albeit complicated at first, is a powerful tool that can improve code development and documentation. Ultimately, the complexity of a VCS not only gives users a well-documented “undo” button for their analyses, but it also allows for collaboration and sharing of code on a massive scale. Furthermore, it does not need to be learned in its entirety to be useful. Instead, you can derive tangible benefits from adopting version control in stages. With a few commands (git init, git add, git commit), you can start tracking your code development and avoid a file system full of copied files (Fig 2). Adding a few additional commands (git push, git clone, git pull) and a GitHub account, you can share your code online, transfer your changes across machines, and collaborate in small groups (Fig 3). Lastly, by forking public repositories and sending pull requests, you can directly improve scientific software (Fig 4). Methods We collaboratively wrote the article in LaTeX (http://www.latex-project.org/) using the online authoring platform Authorea (https://www.authorea.com). Furthermore, we tracked the development of the document using Git and GitHub. The Git repo is available at https://github.com/jdblischak/git-for-science, and the rendered LaTeX article is available at https://www.authorea.com/users/5990/articles/17489. Supporting Information S1 Data process.sh. This Bash script downloads the ENCODE CTCF ChIP-seq data from multiple types of kidney samples and calls peaks. See https://github.com/jdblischak/git-for-science/tree/master/code for instructions on running it. (SH) Click here for additional data file. S2 Data clean.py. This Python script filters peaks with a fold change cutoff and merges peaks from the different kidney samples. See https://github.com/jdblischak/git-for-science/tree/master/code for instructions on running it. (PY) Click here for additional data file. S3 Data analyze.R. This R script creates diagnostic plots on the length of the peaks and their distribution across the genome. See https://github.com/jdblischak/git-for-science/tree/master/code for instructions on running it. (R) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                20 December 2018
                December 2018
                : 14
                : 12
                Affiliations
                School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, United States of America
                Dassault Systemes BIOVIA, UNITED STATES
                Author notes

                The author has declared that no competing interests exist.

                Article
                PCOMPBIOL-D-18-01117
                10.1371/journal.pcbi.1006561
                6301674
                30571677
                © 2018 Benjamin D. Lee

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                Page count
                Figures: 0, Tables: 0, Pages: 6
                Product
                Funding
                The author received no specific funding for writing this article.
                Categories
                Editorial
                Computer and Information Sciences
                Software Engineering
                Software Tools
                Engineering and Technology
                Software Engineering
                Software Tools
                Computer and Information Sciences
                Software Engineering
                Software Development
                Engineering and Technology
                Software Engineering
                Software Development
                Computer and Information Sciences
                Software Engineering
                Software Design
                Engineering and Technology
                Software Engineering
                Software Design
                Computer and Information Sciences
                Computer Software
                Research and Analysis Methods
                Research Assessment
                Citation Analysis
                Computer and Information Sciences
                Software Engineering
                Source Code
                Engineering and Technology
                Software Engineering
                Source Code
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Computer and Information Sciences
                Software Engineering
                Engineering and Technology
                Software Engineering

                Quantitative & Systems biology

                Comments

                Comment on this article