Ten simple rules for starting (and sustaining) an academic data science initiative

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Introduction Data science has emerged as a new paradigm for research. Readers of this journal might be tempted to say this is the research we have been doing all along. However, we contest that there is something fundamentally different in terms of the dimensions of data, diversity of disciplines, as well as the role of the private sector, than what has gone before. We take this position based on our collective experiences in, and observations of, calls to action around data science over the past 10 years. Those calls have resulted in many notable and successful responses from US universities. A working definition of data science Defining data science is like defining the internet—ask 10 people and you get 10 different answers. What most would likely agree on, at a high level of abstraction, is that it draws from statistics, computer science, and applied mathematics to operate on data from one or more domains leading to outcomes not achieved otherwise. The extent to which domain knowledge is incorporated in the work of data science varies, but it is essential for achieving meaningful outcomes. Outcomes that have implications to us as humans and our collective communities and society that, in turn, need to be addressed as part of the data life cycle [1]. In short, data science transcends traditional disciplinary boundaries to discover new insights not owned by any one existing discipline, driven by endless streams of digital data with the promise of translation to societal benefit. Ten years of academic data science The past decade has seen an explosion of data science centers, institutes, and programs appearing across the United States as universities increasingly recognize the importance and promise of data science to university research and education. It has been, and continues to be, an exciting time. But there are systemic challenges faced by these initiatives in the context of the higher education system. Some, but not all, of these challenges center around funding. Campuses fortunate enough to receive initial funding, often as a result of philanthropy or private sector investment, have some measure of sustainability, especially if these funds are in the form of an endowment. However, at most smaller colleges and universities, or those without a lucrative alumnus or local industry investor, just getting started with very limited funding can be daunting. And yet, every school is facing the reality that to truly prepare their student body for the expectations of 21st century employers, they must find a way to incorporate core critical thinking and data-intensive skills into nearly every discipline. This call to action challenges traditional disciplinary silos and begs for new models of higher education. The expectations of our future society will not leave this readership immune. How best to engage in this new paradigm? In 2013, two foundations (the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation) and three universities (the University of California, Berkeley, New York University, and the University of Washington) established a partnership to experiment with creating supportive environments for researchers using and developing data-intensive practices. Known as the Moore-Sloan Data Science Environments (MSDSEs), funding from the foundations helped to establish data science centers on each of the campuses: the Berkeley Institute for Data Science, the Center for Data Science at NYU, and the eScience Institute at UW. The partnership organized around working groups on cross-cutting topics viewed as critical to advancing data science in academia: career paths and alternative metrics, software development, education, reproducibility and open science, reflexive and reflective ethnography, and the role of physical space in collaboration. After 5 years, the partnership authored a lessons-learned paper “Creating Institutional Change in Data Science” [2] which includes the key elements that contributed to their successes and draws out some of their challenges along the way. This paper was followed by a formal evaluation of the MSDSE partnership [3], which includes a landscape survey of 17 other data science initiatives to find commonalities in approaches [4]. Since inception, the MSDSEs have been joined by countless other universities launching initiatives to grapple with the integration of data science in academia (examples in S1 Table). A handful of data science initiatives even pre-date or emerged at the same time as the three Moore-Sloan partners. These many initiatives, and many more not cited, continue to face a unique set of challenges due to their campus’ political, financial, and structural environments. Our consideration of their challenges and those we have faced directly ourselves lead us to recognize a set of global commonalities. We capture them here in the familiar Ten Simple Rules (TSR) format for simplicity, recognizing much more could be said. We do so as representatives of these developments: MSP was formerly an Executive Director of the eScience Institute at the University of Washington and now works with data science centers nationwide through her leadership of the new Academic Data Science Alliance, AEB is Chief of Staff, and PEB is the Dean of the School of Data Science at the University of Virginia. Names and definitions What do we mean by “academic data science initiative”? Data science is typically housed in a cross-departmental unit on academic campuses, such as an Institute, Center, or Program. More recently, Departments and Schools of data science are emerging when a degree is offered. While such formal academic units are necessary for many campuses, they risk not fully recognizing data science as cross-cutting, making it more difficult to move away from a sense of silos and hierarchies. The University of Virginia speaks to the kind of model for higher education called for by considering data science as a “School Without Walls,” an organization that has autonomy with other disciplines. Embracing this concept, we refer to all these efforts and organizations as “initiatives,” emphasizing that data science serves everyone and recognizing that everyone can contribute to the evolution of data-intensive research practices. Assumptions We are assuming you, our readers, have already determined that a data science initiative would be a valuable addition to your campus, or you are indeed already part of a new initiative. Compelling examples of reasons to start a data science initiative have been highlighted elsewhere (e.g., [5]) and include: having an organizational unit to handle the emerging challenges of research software development and sustainability [6,7]; incentivizing the use of, and providing training on, tools for reproducible and open research, data, and scholarship, e.g., Reprozip [8] and Git/GitHub, highlighted in this TSR: [9]; establishing curricula for non-data science departments to integrate into courses [10]; developing and running informal training programs for the researchers on campus who already feel left behind by the data revolution [11–13]; creating a home for cross-disciplinary interest groups to work together on common challenges around data, e.g., [14], including scholarly discussion and actionable progress toward responsible data science, e.g., [15], and diversity, equity, and inclusion (DEI) in data science [16]; and last but certainly not least, providing a space and culture for the data scientists, research software engineers, and consultants who will make this all happen [6,17]. Importantly, the reasons, and thus the approaches, must take into account your campus community’s needs and strengths, your university’s bureaucracy, and the political landscape of your campus. With this TSR, we offer what we believe are the most important considerations, some philosophical, some practical, as you develop and contribute to your initiative’s future. Ten simple rules Rule 1: Don’t try to own everything Building a new data science initiative is all about partnerships and harnessing energy. Don’t try to take away the thunder from existing data science efforts on your campus. You will create undue competition and fail, or at the least dilute an institution-wide effort. Start by relationship building with the groups you are aware of and find areas where you could collaborate (recognizing that you may not even be aware of all the data science groups on your campus). Support existing work where added partners multiply impact and find the unmet needs to tackle as a leading organization, inviting others to join you. Show them what they have to gain: a positive and inclusive culture that draws people in [6] and serves students, researchers, and faculty with partnerships. Researchers faced with an avalanche of data from increasingly sophisticated instruments, models, and algorithms gravitate to data science. Where possible, direct them to other researchers for potential collaborations. In this way, your initiative becomes a network hub to and for researchers and a means to share best practices. Partner with the information/information sciences, statistics, computer science, and other departments to offer courses or develop curricula, especially if you don’t want to start out with the heavy lift of organizing and offering a formal degree (assuming your institution would permit it). But be careful to not take away tuition revenue from existing programs by creating competing courses. While the negative effects of this will depend on the institutional funding model, it is likely where major resistance will lie. Rather, share in the burden and the rewards by partnering on curricula and teaching loads. Done right, your work will increase, not decrease, tuition revenue for everyone. Already there are examples of data science degree programs with up to four or more joint departments sharing the revenue and expenses. Recognize the strengths of these departments and partners, and build upon them. Praise and elevate the work of people in your institution who are doing data science. Give them a platform and they will be your advocates; ignore their work and they will resent your efforts. Be the glue. Rule 2: Leverage champions to get buy-in from stakeholders Have a faculty champion, a senior faculty person who has political sway and can bring in other faculty from diverse schools and colleges across the university. And someone who has the ear of the university leadership (e.g., Provost). Ideally, you have several champions who may form the basis for an executive or steering committee. Establish this committee and an external advisory board (EAB) at the beginning. The former brings ties to departments and schools across campus, the latter invests both intellectually and possibly financially. EAB members who are outside of academia can create bonds to other sectors, such as industry. Leverage their networks to engage with external communities (Rule #9). Reach out to champions who have time and energy to invest in your initiative. They should be willing and able to do work, at least in the beginning. Often, associate deans are well connected to their peers and can contribute more than college deans or department chairs who have so many competing areas to manage. And, as noted with Rule #1, recognize those who put in the time and thought with you. Use your champions to get as much buy-in from across the university as you can: reach out to the president, provost, deans, faculty, and administration. You are choosing to do something very different from anything they have worked on before—it’s more inclusive and more connected than anything on campus. Recognize that you will push the policies that they have worked very hard to write and interpret. Getting buy-in from the administration early means that they are also invested in your success and they know that you understand the complications in achieving your goals. Changing university policy is a slow, complex, political process—you will need all the allies you can get. At the outset, or as interest fades, incentivize engagement from faculty with 1 or 2 months of summer salary or a spot on the executive/steering committee (with clear term limits). Consider negotiating a term of teaching release with their department to do X, where X could be heading up a new Special Interest Group, organizing a career fair or cross-campus data science summit, or developing new teaching tools around data science. This negotiation may be difficult unless you can demonstrate a win for the department from X (e.g., developing a data science curriculum for them). Show them what they have to gain and give them the wins when things succeed. Maintaining goodwill with departments across campus will make it easier for your champions to promote your initiative in their departments. Have some pilot projects with these champions that can be pointed to when you make your case for formalizing your efforts into a campus-wide initiative. Early success stories will make the pitch for engagement from faculty easier. Then, popularize what you propose with the faculty senate and other governance bodies. Finally, realize that champions are not only senior faculty. They are enthusiastic early-career faculty, data savvy staff, students, and postdocs across your campus. If you bring them in, they will be a huge win for your initiative. They are the bridges connecting everyone and tireless advocates. And they can help make a case for data science in departments that remain skeptical of the utility and impact of data-driven approaches in their fields or the value that a data science initiative adds to their own efforts in this space. Set clear and realistic expectations for their engagement. Postdocs, especially, are in a very time-limited and career-critical place. Be sensitive to their needs and don’t expect them to set aside their research and career demands for your initiative’s needs. As much as possible, find ways for them to contribute that also benefits their future career. Give them something meaningful they can put on their CV. Data science opens new career doors with substantial rewards. Your efforts will be somewhat of a Catch-22 as you endeavor to keep them in a competitive market. Rule 3: Have a sustainability plan (and find funding) Once you have secured champions and buy-in from stakeholders, but before you begin hiring the staff that you’ll need, have at least a rudimentary sustainability plan to share with them. You will more quickly attract expert staff when you demonstrate that you understand and are planning for sustainment, as well as a strong start. We will presume you have some initial balance of support from: tuition revenue, philanthropy, private or public sector investment, indirect cost return from grants, or if you are lucky, core budget lines. In short, build on these initial resources, demonstrate value, and sustainability will follow as it does in a high-demand marketplace. When you first pitch for the establishment of your initiative, show the demand and back it up with budget projections. Don’t reinvent the wheel by doing all of your own research. Talk to leaders of data science initiatives on other campuses (e.g., S1 Table, the Academic Data Science Alliance can help connect you). There are over 450 data science degree or certificate programs in the US alone [18], and collaborative, cross-sector groups focusing on data are everywhere. Leverage the work they have done to show the benefit of a data science initiative, and how you will model your initiative to meet the demand. If your administration is still skeptical about data science as rigorous scholarship and its impact on research across campus, bring them examples of funding successes from similar types of institutions—peer pressure can be very effective! Once your initiative is up and running, the question will shift from “what will you do” to “what have you done.” One approach to consider is providing consulting and training services to the campus. Data intensive practices touch nearly every discipline, but the degree to which students, faculty, and staff researchers are trained in these skill sets varies hugely across campus. Focus some of your efforts on serving these researchers with consultations, trainings, or as research partners. Faculty and staff, especially, don’t have the bandwidth to learn data science through formal curricula. Invite your staff to develop their teaching skills by becoming Carpentries instructors (https://carpentries.org/) or developing short, informal courses on common tools like Git/Github. These training opportunities demonstrate the value-add of your initiative to the greater campus community and your efforts may be rewarded with some core university funding. Partner with the Libraries (Rule #10) or other entities that are also creating these opportunities, but be careful that the role of your data science effort isn’t blurred with the mandate of university IT services or with a campus research computing organization. The term “Research Software Engineer” (RSE) is sometimes used to describe professionals on campus who provide consultation services (though their primary role is in software development for research). RSEs are often housed in university IT, but many are also skilled researchers. Data scientists are distinct from RSEs, though there can be a lot of overlap in skill sets making the distinction difficult. Partner with your IT unit and RSEs to fill the cyberinfrastructure and software engineering needs of your projects (see Rule #10). As part of this partnership establish an understanding early on of the infrastructure needs, aside from people, that the IT unit and data science initiative will each provide. The IT department should be a collaborator and its staff your colleagues if data science is to reach its full potential on your campus. Of course, there are other sources of funding that can be pursued in tandem and with partners: tuition revenue from new courses, fees from professional certificate programs, standard research grants to your staff, opportunities to partner with philanthropic organizations, and funding from the private sector (start an industry affiliates program). Regardless of your sources of funding, from Day 1, track everything you do or support: every grant where your initiative is listed as a resource, where you’ve provided a letter of support, or where your core affiliates or staff are listed as PIs (Principal Investigators). Tally these dollars regularly. Track engagements with students, researchers across campus, and externally. Tally these “touches” to demonstrate the reach of your initiative and its impact on campus. Before asking for more university or college funding, collect kudos from everyone you’ve ever helped—a stack of anecdotes that say “we couldn’t have done this without you” from across campus has a big impact. And remember, all of this work can’t be done without support staff. Include them in the process and credit them as equal contributors to your efforts. Budget for and contract with professional evaluators to help you track your successes and identify challenge areas. Do this sooner rather than later or you won’t have the baseline data to show progress. Rule 4: Hire a team, and support them Data science has the power to change many, many people’s lives for the better, and for the worse (see Rule #7). It cannot be emphasized enough that data science initiatives prioritize hiring a diverse workforce who have the backgrounds and lived experiences to ensure the development and application of new technologies appropriately consider sensitive data and marginalized groups, and “do no harm.” Don’t underestimate the staffing needs of a new initiative, both professional administrative staff and research staff. At minimum, plan to hire three professional administrative staff right at the beginning: a head of operations (Program Manager, or similar) who can grow into an Executive/Managing Director role and provide strategic advice in the years to come, a communications and/or events person, and a fiscal specialist. Consider joint hires, but recognize that a larger entity will always dominate the bandwidth of a shared staffer. For the head of operations and communications positions, select people who are intellectually curious about data science and want to make it accessible to all of your stakeholders. In particular, for the communications position, bring in someone with science writing experience who can write stories. Stakeholders—especially donors—need stories. Equally important, target some initial funds to hire data science research and/or consulting staff. They can provide return on your investment both through consultation support to the university, demonstrating the value of your initiative to campus for future provostial funding requests, and/or through their own impactful research which can attract grant dollars. This means giving them PI status, or changing policies in departments and schools where this isn’t allowed. (Yes, really. The authors are baffled by departments who continue to stand in the way of grant dollars.) But have a clear plan for how these appointments will be sustained (Rule #3). Do they receive some initial funding but have responsibility for all or some of their own salaries over time? What are the expectations around fractional or full support? Data-intensive research is by necessity a team sport. The team usually includes one or more members strong in the skill sets that define data science, in addition to subject matter experts. Rarely does one person, or even one “lab” group, have all the necessary skills needed to complete a successful data-intensive project. This is where your initiative comes in. Help domain researchers build the right team by match-making appropriate data scientists with the needed skill sets. Depending on the size of the project, this could start as a new joint hire. Importantly, recognize that the skills that define one data scientist are not necessarily the same as the skills that define another. A data scientist can be someone formally trained in statistics who picked up programming in their free time and data management from practical experience. They can be the social scientist who received some formal data science training (a minor, certificate, or advanced degree) and learned reproducibility tools from colleagues or informal training events. They can be the computer scientist who specializes in data visualization and has training in data ethics. The combinations are endless. Finding the right team members means not just advertising for a “data scientist”—it means knowing what you need and finding the right people (or, if you are only looking to expand your initiative, advertise more generally and build from who you find). Be careful not to look only in your own discipline—biologists can learn a lot about image analysis from astronomers, just as political scientists can learn and apply algorithms from genomicists [19]. As soon as you are able to grow the team, consider hiring an ethnographer. Typically, they are data scientists themselves doing the data science of data science, and they are invaluable for providing thoughtful, real-time feedback on your programs. Often they have their own consultations and original research to contribute (e.g., [20,21]). Be transparent about your expectations and how the staff should contribute to the mission of your initiative. Recognize that different people will have different balances of job duties (development and maintenance of research, educational, and service projects) that work best for them. Manage and evaluate their work accordingly. Nurture passion projects and clearly articulate ties to your initiative’s strategic focus in public-facing documents and your internal groups (see Rule #8). This fuels collaboration internally as well as externally and will assist in your recruiting efforts when people know they will be valued and all aspects of their work will be recognized. Hiring data science staff means you also need to think about their career paths. Promotion pathways and sustainability for data scientists and RSEs has been a topic of discussion for years (reviewed in [17]), yet there are disappointingly few examples of universities who have made real strides in this area until very recently [7]. Revisit the descriptions of current payroll titles and/or create new titles for data scientists that recognize and elevate the knowledge they have to contribute to campus. Similarly, career mentorship for staff data scientists is typically lacking, despite the need for more mentoring, not less, because these are relatively new positions [3,22]. We hope that with increasing successes for the universities who have chosen to value and fund careers for data scientists and RSEs, more universities will follow and increase both funding and recognition for these staff who enable so much of the research on campus. Rule 5: Recognize and elevate data, software, and workflow contributions Hand in hand with career paths are the metrics by which all data-intensive researchers, faculty, and staff are evaluated for hiring, tenure, and promotion. Major pain points for faculty and staff working specifically in data science have to do with the current overemphasis on first-author journal publications. Data and software intensive research can result in months or years of data curation, software design, and data management/analysis workflow developments that are not easily published in traditional journals. And much of the work of a data scientist is “invisible” [23], especially within a new and growing organization. “Invisible” work includes maintenance of software (which is often underbudgeted, if budgeted for at all) and training or consultations for a seemingly endless flow of people who drop in, which do not count as much toward academic advancement as grants and papers. Ironic since without collaboration the research would not happen, so why not measure and reward it? If professional staff advancement is the goal, be sure to include all of the invisible work as part of the evaluation process. Joint faculty hires across departments are already challenging because different fields can have different expectations and measures of success: A few first author papers in high-impact journals is expected in some departments, whereas others are looking for many submissions to conference proceedings, and some place more or less value on single authorship. The above pain points are amplified for joint hires in data science, where trying to explain to a domain committee member the importance of data science work (and vice versa) makes attaining tenure or promotion incredibly difficult. Policies and metrics of success in higher education need to change so that open science, software, and data citations become recognized as equal partners to publications on CVs, and team members are recognized for their contributions, not their position on the author list [7]. Your champions can help by steering the metrics on hiring/tenure/promotion committees, but changes will happen faster if these ideas are echoed from above. It is increasingly important for university and college leadership to recognize that the metrics of success for research in academia are changing, and not just around data science. It’s time for university policies around hiring/tenure/promotion to reflect how research and discovery get done. Data science can lead the way. And finally, colleges and universities must signal their support by prioritizing research software in core budgets. Development and maintenance of software that drives discovery is critical and essential in our current research landscape. It is unreasonable and unsustainable to expect individual PIs to earmark grant funds for software maintenance (if it is even allowed by the granting agency) when this software supports and drives discoveries across campus. Institutional funds for campus-wide resources, such as expensive journal subscriptions and the infrastructure needed to support and promote them, ought to be redistributed to include basic software maintenance, in collaboration with the Libraries and IT (Rule #10). Rule 6: Focus on interdisciplinarity, but don’t overdilute Interdisciplinarity is the essence of data science. Be sincere about your interdisciplinary efforts, but recognize when your core researchers have strength in a particular area and double down on it. Trying to help everyone in every field across your campus at once isn’t realistic or tenable. Start with the expertise of your staff and affiliates and build from there. It’s okay to be known for a subset of disciplines where data science is applied, for example, in Earth science, biomedicine, or sociology. Focusing on your research strengths will help attract grant money. And as a university, you are ultimately part of a supply chain feeding data savvy researchers and thought leaders into society. You will never be able to meet all of the demand in every discipline, so figure out what to supply based on the strengths of your staff and institutional collaborations. As your successes grow, hire in the areas where you have gaps in expertise or domain knowledge and grow into new collaborations over time. One avenue to broaden your community of collaborators and thus your interdisciplinarity is a postdoc fellowship program. Setting aside funds for postdocs to propose projects that cross disciplines will be rewarded by increased visibility of your initiative and increased collaborations, creating more champions (Rule #2). But be aware that data-savvy postdocs will typically command a higher salary than their peers (ignoring for a moment that all postdocs are underpaid). Budget for these accordingly, and be prepared to negotiate with departments, especially in disciplines without large budgets who try to ensure equity across their postdoc population. In the more lucrative fields (e.g., computer science), you may need to arrange for an additional kick-in from the department to bring the salaries up to the level of their peers. Match a domain and a methods mentor for the postdoc project (or better, have the postdoc identify and engage the faculty mentors themselves). This kind of relationship building creates bridges between departments, with your initiative at the nexus. Faculty in departments across your campus will see the value of engaging with your initiative. And some may learn to appreciate data-driven approaches if they have been reticent to bring data science into their work in the past. Rule 7: Emphasize responsible data science If interdisciplinarity is the essence, responsibility should be the pillar of data science. Responsible data science isn’t just about providing an ethics course or discussion group. Ethical thinking and societal perspectives should be infused in your culture and every project you work on. It is a way of thinking that covers all data science research from conceiving of a project to the dissemination of the final product. This will become increasingly important as data science further captures the imagination of all stakeholders, while at the same time high-profile nefarious activities continue to threaten the integrity of the field. The potential for biases to propagate in algorithms and artificial intelligence (AI), such as facial recognition with its intrusive and racially biased outcomes and the recent surge in “bad” data and public misinformation due to the Coronavirus Disease 2019 (COVID-19) pandemic, highlights the need to provide much more rigorous training in ethics and social contexts throughout the data life cycle [1]. While this kind of training and context is often referred to as data science for the public good, it should be front and center in all data science projects and initiatives that have the potential to use data from, or impact the lives of, individuals or groups. These goals can only be achieved if the team reflects the diversity of the communities their work will impact, and with partnerships and all stakeholders at the table, bringing together STEM and humanities such that a virtuous cycle of human impact is fed back into how data are collected and analyzed. And while data science moves quickly, projects must “move at the speed of trust” to carefully apply techniques and incorporate feedback and input from diverse groups (https://www.blackspace.org/manifesto). As you think about the programs your initiative can offer campus, use the knowledge base of your research staff, postdocs, and students to focus on opportunities that play to their strengths and allow your projects to model responsibility. Bring people together around a tool, challenge, event, or idea that emphasizes these values and demonstrates practical applications of responsible and ethical approaches to data science. Some examples: XDs such as ImageXD (https://bids.berkeley.edu/research/image-xd) and TextXD (https://bids.berkeley.edu/research/textxd), Hackweeks [11] (https://uwescience.github.io/HackWeek-Toolkit/), Datapaloozas (e.g., Health Datapalooza https://academyhealth.org/events/2020-02/2020-health-datapalooza, MassCUE Datapalooza https://www.masscue.org/event/masscue-datapalooza-2020/), and Women in Data Science events (https://www.widsconference.org/). Rule 8: Establish a set of guiding principles As a new initiative, you will be tempted to say “yes” to every request. You will quickly get a sense of the needs of your campus and rapidly expand engagement and goodwill. But early on, be sure to establish your MVV (mission, vision, values) and revisit these annually. From your MVV, develop a set of guiding principles that can create boundaries and help you recognize when a project or request is out of scope. A few examples include: transparency (encourage data and software projects to be open source), reproducibility (support and train for reproducible workflows for all projects), emphasize projects that are for the public good (Rule #7), or support the idea of data science for all by steering resources toward domains outside of computer science and statistics. While every project may not address every one of your guiding principles, bringing everything that you do back to those guiding principles will help you shape a voice for your initiative that all your staff and affiliates understand, support, and echo across the campus. Knowing your focus will also help you build in time, by managing load and job expectations, and provide guidance to allow your people to develop and nurture new data science ideas and programs. And down the road, it will help you and everyone on your team to say “no” to requests that fall too far outside these principles. With finite resources, your growing reputation will mean the inability to meet the demand (see Rule #6). Be selective in what you do to match your MVV and partner with other groups on campus to direct requests elsewhere. Rule 9: Engage with external communities Data, and thus data-intensive research, is pervasive. It drives business, shapes policy-making, and influences how we see each other and ourselves in society. Not surprisingly, the private sector, government, and nongovernmental organizations (NGOs) are all looking for talent and partnerships. Engage with them. They have a lot to offer and are often ahead of where we are in academia. But recognize that all external communities move at different paces and have different agendas; align yourself appropriately. Done correctly, these partnerships can be a win-win, bringing in funding and new data sets while supporting the development of new technologies and launching the careers of your students and postdocs. From the partner’s perspective, they gain access to a talent pipeline, a host of research expertise, and the ability to align with projects in keeping with a university’s mission, notably projects for the public good. One great example is the Data Science for Social Good program, pioneered at the University of Chicago (http://www.datasciencepublicpolicy.org/) and then picked up (with modifications) at Georgia Tech (https://ptc.gatech.edu/dssg), Carnegie Mellon (https://www.dssgfellowship.org), and the eScience Institute at the University of Washington (https://escience.washington.edu/dssg/). Rule 10: Leverage core service groups Libraries have been in the information business for centuries, long before computing and long long before the data revolution. Leverage their expertise and the physical spaces they occupy (e.g., [24]). Consider an easy lift: jointly funding some data services support staff. Or be bold and work together to use the library spaces as a one-stop shop for all data and information needs. Many libraries are converting to collaboration spaces as physical resources (such as print media) are moved online or offsite, and library administrators seek creative uses of their physical space. Repurposed library spaces are a great option for a data science initiative: Libraries are considered “politically neutral” and are habitual places where students and researchers seek information and help. Having a neutral location can be critical to getting buy-in from multiple departments and your initiative will be seen less as a territory-grab by one or more existing departments. Location, location, location. Similarly, before campus IT groups were tasked with managing email servers, they spent a substantial amount of staff time providing consultations and supporting research compute needs. Now, as email management moves to decentralized services like Gmail, many IT groups have diversified (enterprise IT, research IT, educational IT). While research computing isn’t always under a campus IT group, nearly every school now provides some amount of research support again, with on-premises high-performance computing (HPC) consultations or connecting and enabling the use of Cloud services (e.g., https://cloudmaven.github.io/documentation/index.html). Reach out to them early to involve them in the collaborative process and do what you can do together. Excellent examples of such partnerships include: DS3 at NYU (Data Science and Software Services; https://cds.nyu.edu/ds3/) and Northwestern IT Research Computing Services (http://www.it.northwestern.edu/research). Here again, depending on the structure of the IT department and their mandate, partnerships may need nurturing. Relationship build and develop a common purpose. Offering to support shared staff hires (e.g., cloud computing support) can get you a seat at the table with access to collaborators and pilot programs. Some IT groups may be in a position to support maintenance of more mature software (with additional budget from the university). Consider developing a pipeline for software developments within your initiative that can be supported by IT, with university funding. Data science initiatives in higher education, together with the Libraries and IT, can serve to match-make research partners and provide training to accelerate impactful research across campus. Perhaps most importantly, data science initiatives can promote responsible data science projects and products so that data-intensive researchers are recognized for their critical roles in establishing just outcomes from the greater research community. Conclusion The role that data and data science is playing and will continue to play in society is without question. This readership, while being savvy when it comes to data, should not ignore the influence that data science will have on the future of our fields. These rules are intended to help you engage in this maelstrom. Some in the field of biology may argue that data science is just a new word for bioinformatics. However, the emergence of data science is very different from bioinformatics in its scope—touching nearly every discipline and reaching nearly every organization across sectors from industry to government to academia. What is also clear is that data science represents a higher degree of interdisciplinarity and opportunity for collaboration than anything that has gone before. Institutions of higher education are coming to realize both the potential and the challenges that data science brings to their campuses. Starting and sustaining a data science initiative is not an easy task, but the reward is in the path that leads to coordinated and deeper integration of responsible and thoughtful data-intensive practices across campus. The authors acknowledge that there are many paths to success, some we inevitably didn’t cover. We hope these rules offer a starting point and some guidance for our readers to learn more. Supporting information S1 Table Examples of data science initiatives launched over the past 10 years. (DOCX) Click here for additional data file.

Related collections

Most cited references 13

Record: found
Abstract: found
Article: not found

Ten simple rules for responsible big data research

Matthew A. Zook, Solon Barocas, Danah Boyd … (2017)

Introduction The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of the computational and natural sciences, to more directly address sensitive aspects of human behavior, interaction, and health. The tools of big data research are increasingly woven into our daily lives, including mining digital medical records for scientific and economic insights, mapping relationships via social media, capturing individuals’ speech and action via sensors, tracking movement across space, shaping police and security policy via “predictive policing,” and much more. The beneficial possibilities for big data in science and industry are tempered by new challenges facing researchers that often lie outside their training and comfort zone. Social scientists now grapple with data structures and cloud computing, while computer scientists must contend with human subject protocols and institutional review boards (IRBs). While the connection between individual datum and actual human beings can appear quite abstract, the scope, scale, and complexity of many forms of big data creates a rich ecosystem in which human participants and their communities are deeply embedded and susceptible to harm. This complexity challenges any normative set of rules and makes devising universal guidelines difficult. Nevertheless, the need for direction in responsible big data research is evident, and this article provides a set of “ten simple rules” for addressing the complex ethical issues that will inevitably arise. Modeled on PLOS Computational Biology’s ongoing collection of rules, the recommendations we outline involve more nuance than the words “simple” and “rules” suggest. This nuance is inevitably tied to our paper’s starting premise: all big data research on social, medical, psychological, and economic phenomena engages with human subjects, and researchers have the ethical responsibility to minimize potential harm. The variety in data sources, research topics, and methodological approaches in big data belies a one-size-fits-all checklist; as a result, these rules are less specific than some might hope. Rather, we exhort researchers to recognize the human participants and complex systems contained within their data and make grappling with ethical questions part of their standard workflow. Towards this end, we structure the first five rules around how to reduce the chance of harm resulting from big data research practices; the second five rules focus on ways researchers can contribute to building best practices that fit their disciplinary and methodological approaches. At the core of these rules, we challenge big data researchers who consider their data disentangled from the ability to harm to reexamine their assumptions. The examples in this paper show how often even seemingly innocuous and anonymized data have produced unanticipated ethical questions and detrimental impacts. This paper is a result of a two-year National Science Foundation (NSF)-funded project that established the Council for Big Data, Ethics, and Society, a group of 20 scholars from a wide range of social, natural, and computational sciences (http://bdes.datasociety.net/). The Council was charged with providing guidance to the NSF on how to best encourage ethical practices in scientific and engineering research, utilizing big data research methods and infrastructures [1]. 1. Acknowledge that data are people and can do harm One of the most fundamental rules of responsible big data research is the steadfast recognition that most data represent or impact people. Simply starting with the assumption that all data are people until proven otherwise places the difficulty of disassociating data from specific individuals front and center. This logic is readily evident for “risky” datasets, e.g., social media with inflammatory language, but even seemingly benign data can contain sensitive and private information, e.g., it is possible to extract data on the exact heart rates of people from YouTube videos [2]. Even data that seemingly have nothing to do with people might impact individuals’ lives in unexpected ways, e.g., oceanographic data that change the risk profiles of communities’ and properties’ values or Exchangeable Image Format (EXIF) records from photos that contain location coordinates and reveal the photographer’s movement or even home location. Harm can also result when seemingly innocuous datasets about population-wide effects are used to shape the lives of individuals or stigmatize groups, often without procedural recourse [3,4]. For example, social network maps for services such as Twitter can determine credit-worthiness [5], opaque recidivism scores can shape criminal justice decisions in a racially disparate manner [6], and categorization based on zip codes resulted in less access to Amazon Prime same-day delivery service for African-Americans in United States cities [7]. These high-profile cases show that apparently neutral data can yield discriminatory outcomes, thereby compounding social inequities. Other cases show that “public” datasets are easily adapted for highly invasive research by incorporating other data, such as Hague et al.’s [8] use of property records and geographic profiling techniques to allegedly identify the pseudonymous artist Banksy [9]. In particular, data ungoverned by substantive consent practices, whether social media or the residual DNA we continually leave behind us, may seem public but can cause unintentional breaches of privacy and other harms [9,10]. Start with the assumption that data are people (until proven otherwise), and use it to guide your analysis. No one gets an automatic pass on ethics. 2. Recognize that privacy is more than a binary value Breaches of privacy are key means by which big data research can do harm, and it is important to recognize that privacy is contextual [11] and situational [12], not reducible to a simple public/private binary. Just because something has been shared publicly does not mean any subsequent use would be unproblematic. Looking at a single Instagram photo by an individual has different ethical implications than looking at someone’s full history of all social media posts. Privacy depends on the nature of the data, the context in which they were created and obtained, and the expectations and norms of those who are affected. Understand that your attitude towards acceptable use and privacy may not correspond with those whose data you are using, as privacy preferences differ across and within societies. For example, Tene and Polonetsky [13] explore how pushing past social norms, particularly in novel situations created by new technologies, is perceived by individuals as “creepy” even when they do not violate data protection regulations or privacy laws. Social media apps that utilize users’ locations to push information, corporate tracking of individuals’ social media and private communications to gain customer intelligence, and marketing based on search patterns have been perceived by some to be “creepy” or even outright breaches of privacy. Likewise, distributing health records is a necessary part of receiving health care, but this same sharing brings new ethical concerns when it goes beyond providers to marketers. Privacy also goes beyond single individuals and extends to groups [10]. This is particularly resonant for communities who have been on the receiving end of discriminatory data-driven policies historically, such as the practice of redlining [14, 15]. Other examples include community maps—made to identify problematic properties or an assertion of land rights—being reused by others to identify opportunities for redevelopment or exploitation [16]. Thus, reusing a seemingly public dataset could run counter to the original privacy intents of those who created it and raise questions about whether it represents responsible big data research. Situate and contextualize your data to anticipate privacy breaches and minimize harm. The availability or perceived publicness of data does not guarantee lack of harm, nor does it mean that data creators consent to researchers using their data. 3. Guard against the reidentification of your data It is problematic to assume that data cannot be reidentified. There are numerous examples of researchers with good intentions and seemingly good methods failing to anonymize data sufficiently to prevent the later identification of specific individuals [17]; in other cases, these efforts were extremely superficial [18, 19]. When datasets thought to be anonymized are combined with other variables, it may result in unexpected reidentification, much like a chemical reaction resulting from the addition of a final ingredient. While the identificatory power of birthdate, gender, and zip code is well known [20], there are a number of other parameters—particularly the metadata associated with digital activity—that may be as or even more useful for identifying individuals [21]. Surprising to many, unlabeled network graphs—such as location and movement, DNA profiles, call records from mobile phone data, and even high-resolution satellite images of the earth—can be used to reidentify people [22]. More important than specifying the variables that allow for reidentification, however, is the realization that it is difficult to recognize these vulnerable points a priori [23]. Factors discounted today as irrelevant or inherently harmless—such as battery usage—may very well prove to be a significant vector of personal identification tomorrow [24]. For example, the addition of spatial location can turn social media posts into a means of identifying home location [25], and Google’s reverse image search can connect previously separate personal activities—such as dating and professional profiles—in unanticipated ways [26]. Even data about groups—“aggregate statistics”—can have serious implications if they reveal that certain communities, for example, suffer from stigmatized diseases or social behavior much more than others [27]. Identify possible vectors of reidentification in your data. Work to minimize them in your published results to the greatest extent possible. 4. Practice ethical data sharing For some projects, sharing data is an expectation of the human participants involved and thus a key part of ethical research. For example, in rare genetic disease research, biological samples are shared in the hope of finding cures, making dissemination a condition of participation. In other projects, questions of the larger public good—an admittedly difficult to define category—provide compelling arguments for sharing data, e.g., the NIH-sponsored database of Genotypes and Phenotypes (dbGaP), which makes deidentified genomic data widely available to researchers, democratizing access, or the justice claim made by the Institute of Medicine about the value of mandating that individual-level data from clinical trials be shared among researchers [28]. Asking participants for broad, as opposed to narrowly structured consent for downstream data management makes it easier to share data. Careful research design and guidance from IRBs can help clarify consent processes. However, we caution that even when broad consent was obtained upfront, researchers should consider the best interests of the human participant, proactively considering the likelihood of privacy breaches and reidentification issues. This is of particular concern for human DNA data, which is uniquely identifiable. These types of projects, however—in which rules of use and sharing are well governed by informed consent and right of withdrawal—are increasingly the exception rather than the rule for big data. In our digital society, we are followed by data clouds composed of the trace elements of daily life—credit card transactions, medical test results, closed-circuit television (CCTV) images and video, smart phone apps, etc.—collected under mandatory terms of service rather than responsible research design overseen by university compliance officers. While we might wish to have the standards of informed consent and right of withdrawal, these informal big data sources are gathered by agents other than the researcher—private software companies, state agencies, and telecommunications firms. These data are only accessible to researchers after their creation, making it impossible to gain informed consent a priori, and contacting the human participants retroactively for permission is often forbidden by the owner of the data or is impossible to do at scale. Of course, researchers within software companies and state institutions collecting these data have a special responsibility to address the terms under which data are collected; but that does not exempt the end-user of shared data. In short, the burden of ethical use (see Rules 1 to 3) and sharing is placed on the researcher, since the terms of service under which the human subjects’ data were produced can often be extremely broad with little protection for breaches of privacy. In these circumstances, researchers must balance the requirements from funding agencies to share data [29] with their responsibilities to the human beings behind the data they acquired. A researcher needs to inform funding agencies about possible ethical concerns before the research begins and guard against reidentification before sharing. Share data as specified in research protocols, but proactively address concerns of potential harm from informally collected big data. 5. Consider the strengths and limitations of your data; big does not automatically mean better In order to do both accurate and responsible big data research, it is important to ground datasets in their proper context including conflicts of interests. Context also affects every stage of research: from data acquisition, to cleaning, to interpretation of findings, and dissemination of the results. During the step of data acquisition, it is crucial to understand both the source of the data and the rules and regulations with which they were gathered. This is especially important in cases of research conducted in relatively loose regulatory environments, in which use of answers to research questions may conflict with the expectations of those who provided the data. One possible approach might be the ethical norms employed to track the provenance of artifacts, often in cooperation and collaboration with the communities from which they come (e.g., archaeologists working in indigenous communities to determine the disposition of material culture). In a similar manner, computer scientists use data lineage techniques to track the evolution of a dataset and often to trace bugs in the data. Being mindful of the data’s context provides the foundation for clarifying when your data and analysis are working and when they are not. While it is tempting to interpret findings based on big data as a clear outcome, a key step within scientific research is clearly articulating what data or an indicator represent and what they do not. Are your findings as clear-cut if your interpretation of a social media posting switches from a recording of fact to the performance of a social identity? Given the messy, almost organic nature of many datasets derived from social actions, it is fundamental that researchers be sensitive to the potential multiple meanings of data. For example, is a Facebook post or an Instagram photo best interpreted as an approval/disapproval of a phenomenon, a simple observation, or an effort to improve status within a friend network? While any of these interpretations are potentially valid, the lack of context makes it even more difficult to justify the choice of one understanding over another. Reflecting on the potential multiple meanings of data fosters greater clarity in research hypotheses and also makes researchers aware of the other potential uses of their data. Again, the act of interpretation is a human process, and because the judgments of those (re)using your data may differ from your own, it is essential to clarify both the strengths and shortcomings of the data. Document the provenance and evolution of your data. Do not overstate clarity; acknowledge messiness and multiple meanings. 6. Debate the tough, ethical choices Research involving human participants at federally funded institutions is governed by IRBs charged with preventing harm through well-established procedures and are familiar to many researchers. IRBs, however, are not the sole arbiter of ethics; many ethical issues involving big data are outside of their governance mandate. Precisely because big data researchers often encounter situations that are foreign to or outside of the mandate of IRBs, we emphasize the importance of debating the issues within groups of peers. Rather than a bug, the lack of clear-cut solutions and governance protocols should be more appropriately understood as a feature that researchers should embrace within their own work. Discussion and debate of ethical issues is an essential part of professional development—both within and between disciplines—as it can establish a mature community of responsible practitioners. Bringing these debates into coursework and training can produce peer reviewers who are particularly well placed to raise these ethical questions and spur recognition of the need for these conversations. A precondition of any formal ethics rules or regulations is the capacity to have such open-ended debates. As digital social scientist and ethicist Annette Markham [30] writes, “we can make [data ethics] an easier topic to broach by addressing ethics as being about choices we make at critical junctures; choices that will invariably have impact.” Given the nature of big data, bringing technical, scientific, social, and humanistic researchers together on projects enables this debate to emerge as a strength because, if done well, it provides the means to understand the ethical issues from a range of perspectives and disrupt the silos of disciplines [31]. There are a number of good models for interdisciplinary ethics research, such as the trainings offered by the Science and Justice research center at the University of California, Santa Cruz [32] and Values in Design curricula [33]. Research ethics consultation services, available at some universities as a result of the Clinical and Translational Science Award (CTSA) program of the National Institutes of Health (NIH), can also be resources for researchers [34]. Some of the better-known “big data” ethical cases—i.e., the Facebook emotional contagion study [35]—provide extremely productive venues for cross-disciplinary discussions. Why might one set of scholars see this as a relatively benign approach while other groups see significant ethical shortcomings? Where do researchers differ in drawing the line between responsible and irresponsible research and why? Understanding the different ways people discuss these challenges and processes provides an important check for researchers, especially if they come from disciplines not focused on human subject concerns. Moreover, the high visibility surrounding these events means that (for better or worse) they represent the “public” view of big data research, and becoming an active member of this conversation ensures that researchers can give voice to their insights rather than simply being at the receiving end of policy decisions. In an effort to help these debates along, the Council for Big Data, Ethics, and Society has produced a number of case studies focused specifically on big data research and a white paper with recommendations to start these important conversations ( http://bdes.datasociety.net/output/ ). Engage your colleagues and students about ethical practice for big data research. 7. Develop a code of conduct for your organization, research community, or industry The process of debating tough choices inserts ethics directly into the workflow of research, making “faking ethics” as unacceptable as faking data or results. Internalizing these debates, rather than treating them as an afterthought or a problem to outsource, is key for successful research, particularly when using trace data produced by people. This is relevant for all research including those within industry who have privileged access to the data streams of digital daily life. Public attention to the ethical use of these data should not be avoided; after all, these datasets are based on an infrastructure that billions of people are using to live their lives, and there is a compelling public interest that research is done responsibly. One of the best ways to cement this in daily practice is to develop codes of conduct for use in your organization or research community and for inclusion in formal education and ongoing training. The codes can provide guidance in peer review of publications and in funding consideration. In practice, a highly visible case of unethical research brings problems to an entire field, not just to those directly involved. Moreover, designing codes of conduct makes researchers more successful. Issues that might otherwise be ignored until they blow up—e.g., Are we abiding by the terms of service or users’ expectations? Does the general public consider our research “creepy”? [13]—can be addressed thoughtfully rather than in a scramble for damage control. This is particularly relevant to public-facing private businesses interested in avoiding potentially unfavorable attention. An additional and longer-term advantage of developing codes of conduct is that it is clear that change is coming to big data research. The NSF funded the Council for Big Data, Ethics, and Society as a means of getting in front of a developing issue and pending regulatory changes within federal rules for the protection of human subjects that are currently under review [1]. Actively developing rules for responsible big data research within a research community is a key way researchers can join this ongoing process. Establish appropriate codes of ethical conduct within your community. Make industry researchers and representatives of affected communities active contributors to this process. 8. Design your data and systems for auditability Although codes of conduct will vary depending on the topic and research community, a particularly important element is designing data and systems for auditability. Responsible internal auditing processes flow easily into audit systems and also keep track of factors that might contribute to problematic outcomes. Developing automated testing processes for assessing problematic outcomes and mechanisms for auditing other's work during review processes can help strengthen research as a whole. The goal of auditability is to clearly document when decisions are made and, if necessary, backtrack to an earlier dataset and address the issue at the root (e.g., if strategies for anonymizing data are compromised). Designing for auditability also brings direct benefits to researchers by providing a mechanism for double-checking work and forcing oneself to be explicit about decisions, increasing understandability and replicability. For example, many types of social media and other trace data are unstructured, and answers to even basic questions such as network ties, location, and randomness depend on the steps taken to collect and collate data. Systems of auditability clarify how different datasets (and the subsequent analysis) differ from each other, aiding understanding and creating better research. Plan for and welcome audits of your big data practices. 9. Engage with the broader consequences of data and analysis practices It is also important for responsible big data researchers to think beyond the traditional metrics of success in business and the academy. For example, the energy demands for digital daily life, a key source of big data for social science research, are significant in this era of climate change [36]. How might big data research lessen the environmental impact of data analytics work? For example, should researchers take the lead in asking cloud storage providers and data processing centers to shift to sustainable and renewable energy sources? As important and publicly visible users of the cloud, big data researchers collectively represent an interest group that could rally behind such a call for change. The pursuit of citations, reputation, or money is a key incentive for pushing research forward, but it can also result in unintended and undesirable outcomes. In contrast, we might ask to what extent is a research project focused on enhancing the public good or the underserved of society? Are questions about equity or promoting other public values being addressed in one’s data streams, or is a big data focus rendering them invisible or irrelevant to your analysis [37]? How can increasingly vulnerable yet fundamentally important public resources—such as state-mandated cancer registries—be protected? How might research aid or inhibit different business and political actors? While all big data research need not take up social and cultural questions, a fundamental aim of research goes beyond understanding the world to considering ways to improve it. Recognize that doing big data research has societal-wide effects. 10. Know when to break these rules The final (and counterintuitive) rule is the charge to recognize when it is appropriate to stray from these rules. For example, in times of natural disaster or a public health emergency, it may be important to temporarily put aside questions of individual privacy in order to serve a larger public good. Likewise, the use of genetic or other biological data collected without informed consent might be vital in managing an emerging disease epidemic. Moreover, be sure to review the regulatory expectations and legal demands associated with protection of privacy within your dataset. After all, this is an exceedingly slippery slope, so before following this rule (to break others), be cautious that the “emergency” is not simply a convenient justification. The best way to ensure this is to build experience in engaging in the tough debates (Rule 6), constructing codes of conduct (Rule 7), and developing systems for auditing (Rule 8). The more mature the community of researchers is about their processes, checks, and balances, the better equipped it is to assess when breaking the rules is acceptable. It may very well be that you do not come to a final clear set of practices. After all, just as privacy is not binary (Rule 2), neither is responsible research. Ethics is often about finding a good or better, but not perfect, answer, and it is important to ask (and try to answer) the challenging questions. Only through this engagement can a culture of responsible big data research emerge. Understand that responsible big data research depends on more than meeting checklists. Conclusion The goal of this set of ten rules is to help researchers do better work and ultimately become more successful while avoiding larger complications, including public mistrust. To achieve this, however, scholars must shift from a mindset that is rigorous when focused on techniques and methodology and naïve when it comes to ethics. Statements to the effect that “Data is [sic] already public” [38] are unjustified simplifications of much more complex data ecosystems embedded in even more complex and contingent social practices. Data are people, and to maintain a rigorously naïve definition to the contrary [18] will end up harming research efforts in the long run as pushback comes from the people whose actions and utterances are subject to analysis. In short, responsible big data research is not about preventing research but making sure that the work is sound, accurate, and maximizes the good while minimizing harm. The problems and choices researchers face are real, complex, and challenging and so too must be our response. We must treat big data research with the respect that it deserves and recognize that unethical research undermines the production of knowledge. Fantastic opportunities to better understand society and our world exist, but with these opportunities also come the responsibility to consider the ethics of our choices in the everyday practices and actions of our research. The Council for Big Data, Ethics, and Society ( http://bdes.datasociety.net/ ) provides an initial set of case studies, papers, and even ten simple rules for guiding this process; it is now incumbent on you to use and improve these in your research.

0 comments Cited 92 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Ten Simple Rules for Taking Advantage of Git and GitHub

Yasset Perez-Riverol, Laurent Gatto, Rui Wang … (2016)

Introduction Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [1,2]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [3,4]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use. Box 1 By default, GitHub repositories are freely visible to all. Many projects decide to share their work publicly and openly from the start of the project in order to attract visibility and to benefit from contributions from the community early on. Some other groups prefer to work privately on projects until they are ready to share their work. Private repositories ensure that work is hidden but also limit collaborations to just those users who are given access to the repository. These repositories can then be made public at a later stage, such as, for example, upon submission, acceptance, or publication of corresponding journal articles. In some cases, when the collaboration was exclusively meant to be private, some repositories might never be made publicly accessible. GitHub relies, at its core, on the well-known and open-source version control system Git, originally designed by Linus Torvalds for the development of the Linux kernel and now developed and maintained by the Git community. One reason for GitHub’s success is that it offers more than a simple source code hosting service [5,6]. It provides developers and researchers with a dynamic and collaborative environment, often referred to as a social coding platform, that supports peer review, commenting, and discussion [7]. A diverse range of efforts, ranging from individual to large bioinformatics projects, laboratory repositories, as well as global collaborations, have found GitHub to be a productive place to share code and ideas and to collaborate (see Table 1). 10.1371/journal.pcbi.1004947.t001 Table 1 Bioinformatics repository examples with good practices of using GitHub. The table contains the name of the repository, the type of example (issue tracking, branch structure, unit tests), and the URL of the example. All URLs are prefixed with https://github.com/. Name of the Repository Type URL Adam Community Project, Multiple forks https://github.com/bigdatagenomics/adam BioPython [18] Community Project, Multiple contributors https://github.com/biopython/biopython/graphs/contributors Computational Proteomics Unit Lab Repository https://github.com/ComputationalProteomicsUnit Galaxy Project [19] Community Project, Bioinformatics Repository https://github.com/galaxyproject/galaxy GitHub Paper Manuscript, Issue discussion, Community Project https://github.com/ypriverol/github-paper MSnbase [20] Individual project repository https://github.com/lgatto/MSnbase/ OpenMS [21] Bioinformatics Repository, Issue discussion, branches https://github.com/OpenMS/OpenMS/issues/1095 PRIDE Inspector Toolsuite [22] Project Organization, Multiple projects https://github.com/PRIDE-Toolsuite Retinal wave data repository [23] Individual project, Manuscript, Binary Data organized https://github.com/sje30/waverepo SAMtools [24] Bioinformatics Repository, Project Organization https://github.com/samtools rOpenSci Community Project, Issue discussion https://github.com/ropensci The Global Alliance For Genomics and Health Community Project https://github.com/ga4gh Some of the recommendations outlined below are broadly applicable to repository hosting services. However, our main aim is to highlight specific GitHub features. We provide a set of recommendations that we believe will help the reader to take full advantage of GitHub’s features for managing and promoting projects in bioinformatics as well as in many other research domains. The recommendations are ordered to reflect a typical development process: learning Git and GitHub basics, collaboration, use of branches and pull requests, labeling and tagging of code snapshots, tracking project bugs and enhancements using issues, and dissemination of the final results. Rule 1: Use GitHub to Track Your Projects The backbone of GitHub is the distributed version control system Git. Every change, from fixing a typo to a complete redesign of the software, is tracked and uniquely identified. Although Git has a complex set of commands and can be used for rather complex operations, learning to apply the basics requires only a handful of new concepts and commands and will provide a solid ground to efficiently track code and related content for research projects. Many introductory and detailed tutorials are available (see Table 2 below for a few examples). In particular, we recommend A Quick Introduction to Version Control with Git and GitHub by Blischak et al. [5]. 10.1371/journal.pcbi.1004947.t002 Table 2 Online courses, tutorials, and workshops about GitHub and Git for scientists. Name of the Material URL Git help and Git help -a Document, installed with Git Karl Broman’s Git/Github Guide http://kbroman.org/github_tutorial/ Version Control with GitVersion Control with Git http://swcarpentry.github.io/git-novice/ Introduction to Git http://git-scm.com/book/ch1-3.html Github Training https://training.github.com/ Github Guides https://guides.github.com/ Good Resources for Learning Git and GitHub https://help.github.com/articles/good-resources-for-learning-git-and-github/ Software Carpentry: Version Control with Git http://swcarpentry.github.io/git-novice/ In a nutshell, initializing a (local) repository (often abbreviated as repo) marks a directory as one to be tracked (Fig 1). All or parts of its content can be added explicitly to the list of files to track. 10.1371/journal.pcbi.1004947.g001 Fig 1 The structure of a GitHub-based project illustrating project structure and interactions with the community. cd project ## move into directory to be tracked git init ## initialize local repository ## add individual files such as project description, reports, source code git add README project.md code.R git commit -m "initial commit" ## saves the current local snapshot Subsequently, every change to the tracked files, once committed, will be recorded as a new revision, or snapshot, uniquely identifying the changes in all the modified files. Git is remarkably effective and efficient in archiving the complete history of a project by, among other things, storing only the differences between files. In addition to local copies of the repository, it is straightforward to create remote repositories on GitHub (called origin, with default branch master—see below) using the web interface, and then synchronize local and remote repositories. git push origin master ## push local changes to the remote repository git pull origin master ## pull remote changes into the local repository Following Tony Rossini’s advice in 2005 to “commit early, commit often, and commit in a repository from which we can easily roll-back your mistakes,” one can organize one’s work in small incremental changes. At any time, it is possible to go back to a previous version. In larger projects, multiple users are able to work on the same remote repository, with all contributions being recorded, restorable, and attributed to the author. Users usually track source code, text files, images, and small data files inside their repositories and generally do not track derived files such as build logs or compiled binaries (read Box 2 to see how to handle large binary files in GitHub). And, although the majority of GitHub repositories are used for software development, users can also keep text documents such as analysis reports and manuscripts (see, for example, the repository for this manuscript at https://github.com/ypriverol/github-paper). Box 2 Using GitHub or any similar versioning/tracking system is not a replacement for good project management; it is an extension, an improvement for good project and file managing (see for example [9]). One practical consideration when using GitHub, for example, is dealing with large binary files. Binary files such as images, videos, executable files, or many raw data used in bioinformatics, are stored as a single large entity in Git. As a result, every change, even if minimal, leads to a complete new copy of the file in the repository, producing large size increments and the inability to search (see https://help.github.com/articles/searching-code/) and compare file content across revisions. Git offers a Large File Storage (LFS) module that replaces such large files with pointers while the large binary file can be stored remotely, which results in smaller and faster repositories. Git LFS is also supported by GitHub, albeit with a space quota or for a fee, to retain your usual GitHub workflow (https://help.github.com/categories/managing-large-files/) (S1 File, Section 1). Due to its distributed design, each up-to-date local Git repository is an entire exact historical copy of everything that was committed—file changes, commit message logs, etc. These copies act as independent backups as well, present on each user’s storage device. Git can be considered to be fault-tolerant because of this, which is a win over centralized version control systems. If the remote GitHub server is unavailable, collaboration and work can continue between users, as opposed to centralized alternatives. The web interface offered by GitHub provides friendly tools to perform many basic operations and a gentle introduction to a more rich and complex set of functionalities. Various graphical user-interface-driven clients for managing Git and GitHub repositories are also available (https://www.git-scm.com/downloads/guis). Many editors and development environments such as, for example, the popular RStudio editor for the R programming language [8], directly integrate with code versioning using Git and GitHub. In addition, for remote Git repositories, GitHub provides its own features that will be described in subsequent rules (Fig 1). Rule 2: GitHub for Single Users, Teams, and Organizations Public projects on GitHub are visible to everyone, but write permission, i.e., the ability to directly modify the content of a repository, needs to be granted explicitly. As a repository owner, you can grant this right to other GitHub users. In addition to being owned by users, repositories can also be created and managed as part of teams and organizations. Project managers can structure projects to manage permissions at different levels: users, teams, and organizations. Users are the central element of GitHub as in any other social network. Every user has a profile listing their GitHub projects and activities, which can optionally be populated with personal information including name, email address, image, and webpage. To stay up to date with the activity of other users, one can follow their accounts (see also Rule 10). Collaboration can be achieved by simply adding a trusted Collaborator, thereby granting write access. However, development in large projects is usually done by teams of people within a larger organization. GitHub organizations are a great way to manage team-based access permissions for the individual projects of institutes, research labs, and large open-source projects that need multiple owners and administrators (Fig 1). We recommend that you, as an individual researcher, make your profile visible to other users and display all of the projects and organizations you are working in. Rule 3: Developing and Collaborating on New Features: Branching and Forking Anyone with a GitHub account can fork any repository they have access to. This will create a complete copy of the content of the repository, while retaining a link to the original “upstream” version. One can then start working on the same code base in one’s own fork (https://help.github.com/articles/fork-a-repo/) under their username (see, for example, https://github.com/ypriverol/github-paper/network/members for this work) or organization (see Rule 2). Forking a repository allows users to freely experiment with changes without affecting the original project and forms the basis of social coding. It allows anyone to develop and test novel features with existing code and offers the possibility of contributing novel features, bug fixes, and improvements to documentation back into the original upstream project (requested by opening an pull request) repository and becoming a contributor. Forking a repository and providing pull requests constitutes a simple method for collaboration inside loosely defined teams and over more formal organizational boundaries, with the original repository owner(s) retaining control over which external contributions are accepted. Once a pull request is opened for review and discussion, it usually results in additional insights and increased code quality [7]. Many contributors can work on the same repository at the same time without running into edit conflicts. There are multiple strategies for this, and the most common way is to use Git branches to separate different lines of development. Active development is often performed on a development branch and stable versions, i.e., those used for a software release, are kept in a master or release branch (see for example https://github.com/OpenMS/OpenMS/branches). In practice, developers often work concurrently on one or several features or improvements. To keep commits of the different features logically separated, distinct branches are typically used. Later, when development is complete and verified to work (i.e., none of the tests fail, see Rule 5), new features can be merged back into the development line or master branch. In addition, one can always pull the currently up-to-date master branch into a feature branch to adapt the feature to the changes in the master branch. When developing different features in parallel, there is a risk of applying incompatible changes in different branches/forks; these are said to become out of sync. Branches are just short-term departures from master. If you pull frequently, you will keep your copy of the repository up to date and you will have the opportunity to merge your changed code with others’ contributors, ideally without requiring you to manually address conflicts to bring the branches in sync again. Rule 4: Naming Branches and Commits: Tags and Semantic Versions Tags can be used to label versions during the development process. Version numbering should follow “semantic versioning” practice, with the format X.Y.Z., with X being the major, Y the minor, and Z the patch version of the release, including possible meta information, as described in http://semver.org/. This semantic versioning scheme provides users with coherent version numbers that document the extent (bug fixes or new functionality) and backwards compatibility of new releases. Correct labeling allows developers and users to easily recover older versions, compare them, or simply use them to reproduce results described in publications (see Rule 8). This approach also help to define a coherent software publication strategy. Rule 5: Let GitHub Do Some Tasks for You: Integrate The first rule of software development is that the code needs to be ready to use as soon as possible [10], to remain so during development, and that it should be well-documented and tested. In 2005, Martin Fowler defined the basic principles for continuous integration in software development [11]. These principles have become the main reference for best practices in continuous integration, providing the framework needed to deploy software and, in some way, also data. In addition to mere error-free execution, dedicated code testing is aimed at detecting possible bugs introduced by new features or changes in the code or dependencies, as well as detecting wrong results, often known as logic errors, in which the source code produces a different result than what was intended. Continuous integration provides a way to automatically and systematically run a series of tests to check integrity and performance of code, a task that can be automated through GitHub. GitHub offers a set of hooks (automatically executed scripts) that are run after each push to a repository, making it easier to follow the basic principles of continuous integration. The GitHub web hooks allow third-party platforms to access and interact with a GitHub repository and thus to automate post-processing tasks. Continuous integration can be achieved by Travis CI, a hosted continued integration platform that is free for all open-source projects. Travis CI builds and tests the source code using a plethora of options such as different platforms and interpreter versions (S1 File, Section 2). In addition, it offers notifications that allow your team and contributors to know if the new changes work and to prevent the introduction of errors in the code (for instance, when merging pull requests), making the repository always ready to use. Rule 6: Let GitHub Do More Tasks for You: Automate More than just code compilation and testing can be integrated into your software project: GitHub hooks can be used to automate numerous tasks to help improve the overall quality of your project. An important complement to successful test completion is to demonstrate that the tests sufficiently cover the existing code base. For this, the integration of Codecov is recommended. This service will report how much of the code base and which lines of code are being executed as part of your code tests. The Bioconductor project, for example, highly recommends that packages implement unit testing (S1 File, Section 2) to support developers in their package development and maintenance (http://bioconductor.org/developers/unitTesting-guidelines/) and systematically tests the coverage of all of its packages (https://codecov.io/github/Bioconductor-mirror/). One might also consider generating the documentation upon code/documentation modification (S1 File, Section 3). This implies that your projects provide comprehensive documentation so others can understand and contribute back to them. For Python or C/C++ code, automatic documentation generation can be done using sphinx and subsequently integrated into GitHub using “Read the Docs.” All of these platforms will create reports and badges (sometimes called shields) that can be included on your GitHub project page, helping to demonstrate that the content is of high quality and well-maintained. Rule 7: Use GitHub to Openly and Collaboratively Discuss, Address, and Close Issues GitHub issues are a great way to keep track of bugs, tasks, feature requests, and enhancements. While classical issue trackers are primarily intended to be used as bug trackers, in contrast, GitHub issue trackers follow a different philosophy: each tracker has its own section in every repository and can be used to trace bugs, new ideas, and enhancements by using a powerful tagging system. The main objective of issues in GitHub is promoting collaboration and providing context by using cross-references. Raising an issue does not require lengthy forms to be completed. It only requires a title and, preferably, at least a short description. Issues have very clear formatting and provide space for optional comments, which allow anyone with a Github account to provide feedback. For example, if the developer needs more information to be able to reproduce a bug, he or she can simply request it in a comment. Additional elements of issues are (i) color-coded labels that help to categorize and filter issues, (ii) milestones, and (iii) one assignee responsible for working on the issue. They help developers to filter and prioritize tasks and turn an issue tracker into a planning tool for their project. It is also possible for repository administrators to create issue and pull request templates (https://help.github.com/articles/helping-people-contribute-to-your-project/) (see Rule 3) to customize and standardize the information to be included when contributors open issues. GitHub issues are thus dynamic, and they pose a low entry barrier for users to report bugs and request features. A well-organized and tagged issue tracker helps new contributors and users to understand a project more deeply. As an example, one issue in the OpenMS repository (https://github.com/OpenMS/OpenMS/issues/1095) allowed the interaction of eight developers and attracted more than one hundred comments. Contributors can add figures, comments, and references to other issues and pull requests in the repository, as well as direct references to code. As another illustration of issues and their generic and wide application, we (https://github.com/ypriverol/github-paper/issues) and others (https://github.com/ropensci/RNeXML/issues/121) used GitHub issues to discuss and comment on changes in manuscripts and address reviewers’ comments. Rule 8: Make Your Code Easily Citable, and Cite Source Code! It is a good research practice to ensure permanent and unambiguous identifiers for citable items like articles, datasets, or biological entities such as proteins, genes, and metabolites (see also Box 3). Digital Object Identifiers (DOIs) have been used for many years as unique and unambiguous identifiers for enabling the citation of scientific publications. More recently, a trend has started to mint DOIs for other types of scientific products such as datasets [12] and training materials (for example [13]). A key motivation for this is to build a framework for giving scientists broader credit for their work [14,15] while simultaneously supporting clearer, more persistent ways to cite and track it. Helping to drive this change are funding agencies such as the National Institutes of Health (NIH) and National Science Foundation (NSF) in the United States and Research Councils in the United Kingdom, which are increasingly recognizing the importance of research products such as publicly available datasets and software. Box 3 Every repository should ideally have the following three files. The first and arguably most important file in a repository is a LICENCE file (see also Rule 8) that clearly defines the permissions and restrictions attached to the code and other files in your repository. The second important file is a README file, which provides, for example, a short description of the project, a quick start guide, information on how to contribute, a TODO list, and links to additional documentation. Such README files are typically written in markdown, a simple markup language that is automatically rendered on GitHub. Finally, a CITATION file to the repository informs your users how to cite and credit your project. A common issue with software is that it normally evolves at a different speed than text published in the scientific literature. In fact, it is common to find software having novel features and functionality that were not described in the original publication. GitHub now integrates with archiving services such as Zenodo and Figshare, enabling DOIs to be assigned to code repositories. The procedure is relatively straightforward (see https://guides.github.com/activities/citable-code/), requiring only the provision of metadata and a series of administrative steps. By default, Zenodo creates an archive of a repository each time a new release is created in GitHub, ensuring the cited code remains up to date. Once the DOI has been assigned, it can be added to literature information resources such as Europe PubMed Central [16]. As already mentioned in the introduction, reproducibility of scientific claims should be enabled by providing the software, the datasets, and the process leading to interpretable results that were used in a particular study. As much as possible, publications should highlight that the code is freely available in, for example, GitHub, together with any other relevant outputs that may have been deposited. In our experience, this openness substantially increases the chances of getting the paper accepted for publication. Journal editors and reviewers receive the opportunity to reproduce findings during the manuscript review process, increasing confidence in the reported results. In addition, once the paper is published, your work can be reproduced by other members of the scientific community, which can increase citations and foster opportunities for further discussion and collaboration. The availability of a public repository containing the source code does not make the software open-source per se. You should use an Open Source Initiative (OSI)-approved license that defines how the software can be freely used, modified, and shared. Common licenses such as those listed on http://choosealicense.com are preferred. Note that the LICENSE file in the repository should be a plain-text file containing the contents of an OSI-approved license, not just a reference to the license. Rule 9: Promote and Discuss Your Projects: Web Page and More The traditional way to promote scientific software is by publishing an associated paper in the peer-reviewed scientific literature, though, as pointed out by Buckheir and Donoho, this is just advertising [17]. Additional steps can boost the visibility of an organization. For example, GitHub Pages are simple websites freely hosted by GitHub. Users can create and host blog websites, help pages, manuals, tutorials, and websites related to specific projects. Pages comes with a powerful static site generator called Jekyll that can be integrated with other frameworks such as Bootstrap or platforms such as Disqus to support and moderate comments. In addition, several real-time communication platforms have been integrated with GitHub such as Gitter and Slack. Real-time communication systems allow the user community, developers, and project collaborators to exchange ideas and issues and to report bugs or get support. For example, Gitter is a GitHub-based chat tool that enables developers and users to share aspects of their work. Gitter inherits the network of social groups operating around GitHub repositories, organizations, and issues. It relies on identities within GitHub creating Internet Relay Chat (IRC)-like chat rooms for public and private projects. Within a Gitter chat, members can reference issues, comments, and pull requests. GitHub also supports wikis (which are version-controlled repositories themselves) for each repository, in which users can create and edit pages for documentation, examples, or general support. A different service is Gist, which represents a unique way to share code snippets, single files, parts of files, or full applications. Gists can be generated in two different ways: public gists that can be browsed and searched through Discover and secret gists that are hidden from search engines. One of the main features of Gist is the possibility of embedding code snippets in other applications, enabling users to embed gists in any text field that supports JavaScript. Rule 10: Use GitHub to Be Social: Follow and Watch In the same way researchers are following developments in their field, scientific programmers could follow publicly available projects that might benefit their research. GitHub enables this functionality by following other GitHub users (see also Rule 2) or watching the activity of projects, which is a common feature in many social media platforms. Take advantage of it as much as possible! Conclusions If you are involved in scientific research and have not used Git and GitHub before, we recommend that you explore its potential as soon as possible. As with many tools, a learning curve lays ahead, but several basic yet powerful features are accessible even to the beginner and may be applied to many different use-cases [6]. We anticipate the reward will be worth your effort. To conclude, we would like to recommend some examples of bioinformatics repositories in GitHub (Table 1) and some useful training materials, including workshops, online courses, and manuscripts (Table 2). Supporting Information S1 File Supplementary Information including three sections: Git Large File Storage (LFS), Testing Levels of the Source Code and Continuous integration, and Source code documentation. (PDF) Click here for additional data file.

0 comments Cited 60 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Software Carpentry: lessons learned.

Greg Wilson (2014)

Since its start in 1998, Software Carpentry has evolved from a week-long training course at the US national laboratories into a worldwide volunteer effort to improve researchers' computing skills. This paper explains what we have learned along the way, the challenges we now face, and our plans for the future.

0 comments Cited 40 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Russell Schwartz: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput Biol

Journal ID (publisher-id): plos

Journal ID (pmc): ploscomp

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Electronic): 18 February 2021

Publication date Collection: February 2021

Volume: 17

Issue: 2

Electronic Location Identifier: e1008628

Affiliations

[1 ] Academic Data Science Alliance, Seattle, Washington, United States of America

[2 ] School of Data Science, University of Virginia, Charlottesville, Virginia, United States of America

Carnegie Mellon University, UNITED STATES

Author notes

The authors have declared that no competing interests exist.

* E-mail: micaelaparkerphd@ 123456gmail.com

Author information

Micaela S. Parker https://orcid.org/0000-0003-1007-4612

Philip E. Bourne https://orcid.org/0000-0002-7618-7292

Article

Publisher ID: PCOMPBIOL-D-20-00783

DOI: 10.1371/journal.pcbi.1008628

PMC ID: 7891724

PubMed ID: 33600414

SO-VID: aa9ab12d-63fe-48c7-b21d-2734954fe199

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Page count

Figures: 0, Tables: 0, Pages: 12

Funding

Portions of this work were supported by the Gordon and Betty Moore Foundation (grant #8432 to MSP), http://www.moore.org/ the Alfred P. Sloan Foundation (grant #G-2019-11447 to MSP), http://www.sloan.org/ and the University of Virginia (PEB, AEB). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Ten simple rules for starting (and sustaining) an academic data science initiative

Read this article at

Abstract

Related collections

Novel Coronavirus Disease COVID-19

Most cited references 13

Ten simple rules for responsible big data research

Ten Simple Rules for Taking Advantage of Git and GitHub

Software Carpentry: lessons learned.

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 7

Cited by 6

Most referenced authors 120