Ten simple rules for collaboratively writing a multi-authored paper

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Introduction Science is increasingly done in large teams [1], making it more likely that papers will be written by several authors from different institutes, disciplines, and cultural backgrounds. A small number of “Ten simple rules” papers have been written on collaboration [2, 3] and on writing [4, 5] but not on combining the two. Collaborative writing with multiple authors has additional challenges, including varied levels of engagement of coauthors, provision of fair credit through authorship or acknowledgements, acceptance of a diversity of work styles, and the need for clear communication. Miscommunication, a lack of leadership, and inappropriate tools or writing approaches can lead to frustration, delay of publication, or even the termination of a project. To provide insight into collaborative writing, we use our experience from the Global Lake Ecological Observatory Network (GLEON) [6] to frame 10 simple rules for collaboratively writing a multi-authored paper. We consider a collaborative multi-authored paper to have three or more people from at least two different institutions. A multi-authored paper can be a result of a single discrete research project or the outcome of a larger research program that includes other papers based on common data or methods. The writing of a multi-authored paper is embedded within a broader context of planning and collaboration among team members. Our recommended rules include elements of both the planning and writing of a paper, and they can be iterative, although we have listed them in numerical order. It will help to revisit the rules frequently throughout the writing process. With the 10 rules outlined below, we aim to provide a foundation for writing multi-authored papers and conducting exciting and influential science. Rule 1: Build your writing team wisely The writing team is formed at the beginning of the writing process. This can happen at different stages of a research project. Your writing team should be built upon the expertise and interest of your coauthors. A good way to start is to review the initial goal of the research project and to gather everyone’s expectations for the paper, allowing all team members to decide whether they want to be involved in the writing. This step is normally initiated by the research project leader(s). When appointing the writing team, ensure that the team has the collective expertise required to write the paper and stay open to bringing in new people if required. If you need to add a coauthor at a later stage, discuss this first with the team (Rule 8) and be clear as to how the person can contribute to the paper and qualify as a coauthor (Rules 4 and 10). When in doubt about selecting coauthors, in general we suggest to opt for being inclusive. A shared list with contact information and the contribution of all active coauthors is useful for keeping track of who is involved throughout the writing process. In order to share the workload and increase the involvement of all coauthors during the writing process, you can distribute specific roles within the team (e.g., a team leader and a facilitator [see Rule 2] and a note taker [see Rule 8]). Rule 2: If you take the lead, provide leadership Leadership is critical for a multi-authored paper to be written in a timely and satisfactory manner. This is especially true for large, joint projects. The leader of the writing process and first author typically are the same person, but they don’t have to be. The leader is the contact person for the group, keeps the writing moving forward, and generally should manage the writing process through to publication. It is key that the leader provides strong communication and feedback and acknowledges contributions from the group. The leader should incorporate flexibility with respect to timelines and group decisions. For different leadership styles, refer to [7, 8]. When developing collaborative multi-authored papers, the leader should allow time for all voices to be heard. In general, we recommend leading multi-authored papers through consensus building and not hierarchically because the manuscript should represent the views of all authors (Rule 9). At the same time, the leader needs to be able to make difficult decisions about manuscript structure, content, and author contributions by maintaining oversight of the project as a whole. Finally, a good leader must know when to delegate tasks and share the workload, e.g., by delegating facilitators for a meeting or assigning responsibilities and subleaders for sections of a manuscript. At times, this may include recognizing that something has changed, e.g., a change in work commitments by a coauthor or a shift in the paper’s focus. In such a case, it may be timely for someone else to step in as leader and possibly also as first author, while the previous leader’s work is acknowledged in the manuscript or as a coauthor (Rule 4). Rule 3: Create a data management plan If not already implemented at the start of the research project, we recommend that you implement a data management plan (DMP) that is circulated at an early stage of the writing process and agreed upon by all coauthors (see also [9] and https://dmptool.org/; https://dmponline.dcc.ac.uk/). The DMP should outline how project data will be shared, versioned, stored, and curated and also details of who within the team will have access to the (raw) data during and post publication. Multi-authored papers often use and/or produce large datasets originating from a variety of sources or data contributors. Each of these sources may have different demands about how data and code are used and shared during analysis and writing and after publication. Previous articles published in the “Ten simple rules” series provide guidance on the ethics of big-data research [10], how to enable multi-site collaborations through open data sharing [3], how to store data [11], and how to curate data [12]. As many journals now require datasets to be shared through an open access platform as a prerequisite to paper publication, the DMP should include detail on how this will be achieved and what data (including metadata) will be included in the final dataset. Your DMP should not be a complicated, detailed document and can often be summarized in a couple of paragraphs. Once your DMP is finalized, all data providers and coauthors should confirm that they agree with the plan and that their institutional and/or funding agency obligations are met. It is our experience within GLEON that these obligations vary widely across the research community, particularly at an intercontinental scale. Rule 4: Jointly decide on authorship guidelines Defining authorship and author order are longstanding issues in science [13]. In order to avoid conflict, you should be clear early on in the research project what level of participation is required for authorship. You can do this by creating a set of guidelines to define the contributions and tasks worthy of authorship. For an authorship policy template, see [14] and check your institute’s and the journal’s authorship guidelines. For example, generating ideas, funding acquisition, data collection or provision, analyses, drafting figures and tables, and writing sections of text are discrete tasks that can constitute contributions for authorship (see, e.g., the CRediT system: http://docs.casrai.org/CRediT [15]). All authors are expected to participate in multiple tasks, in addition to editing and approving the final document. It is debated whether merely providing data does qualify for coauthorship. If data provision is not felt to be grounds for coauthorship, you should acknowledge the data provider in the Acknowledgments [16]. Your authorship guidelines can also increase transparency and help to clarify author order. If coauthors have contributed to the paper at different levels, task-tracking and indicating author activity on various tasks can help establish author order, with the person who contributed most in the front. Other options include groupings based on level of activity [17] or having the core group in the front and all other authors listed alphabetically. If every coauthor contributed equally, you can use alphabetical order [18] or randomly assigned order [19]. Joint first authorship should be considered when appropriate. We encourage you to make a statement about author order (e.g., [19]) and to generate authorship attribution statements; many journals will include these as part of the Acknowledgments if a separate statement is not formally required. For those who do not meet expectations for authorship, an alternative to authorship is to list contributors in the Acknowledgments [15]. Be aware of coauthors’ expectations and disciplinary, cultural, and other norms in what constitutes author order. For example, in some disciplines, the last author is used to indicate the academic advisor or team leader. We recommend revisiting definitions of authorship and author order frequently because roles and responsibilities may change during the writing process. Rule 5: Decide on a writing strategy The writing strategy should be adapted according to the needs of the team (white shapes in Fig 1) and based on the framework given through external factors (gray shapes in Fig 1). For example, a research paper that uses wide-ranging data might have several coauthors but one principal writer (e.g., a PhD candidate) who was conducting the analysis, whereas a comment or review in a specific research field might be written jointly by all coauthors based on parallel discussion. In most cases, the approach that everyone writes on everything is not possible and is very inefficient. Most commonly, the paper is split into sub-sections based on what aspects of the research the coauthors have been responsible for or based on expertise and interest of the coauthors. Regardless of which writing strategy you choose, the importance of engaging all team members in defining the narrative, format, and structure of the paper cannot be overstated; this will preempt having to rewrite or delete sections later. 10.1371/journal.pcbi.1006508.g001 Fig 1 Decision chart for writing strategy. Different writing strategies ranging from very inclusive to minimally inclusive: group writing = everyone writes on everything; subgroup writing = document is split up into expertise areas, each individual contributes to a subsection; core writing group = a subgroup of a few coauthors writes the paper; scribe writing = one person writes based on previous group discussions; principal writer = one person drafts and writes the paper (writing styles adapted from [20]). Which writing strategy you choose depends on external factors (filled, gray shapes), such as the interdisciplinarity of the study or the time pressure of the paper to be published, and affects the payback (dashed, white shapes). An increasing height of the shape indicates an increasing quantity of the decision criteria, such as the interdisciplinarity, diversity, feasibility, etc. For an efficient writing process, try to use the active voice in suggestions and make direct edits rather than simply stating that a section needs revision. For all writing strategies, the lead author(s) has to ensure that the completed text is cohesive. Rule 6: Choose digital tools to suit your needs A suitable technology for writing your multi-authored paper depends upon your chosen writing approach (Rule 5). For projects in which the whole group writes together, synchronous technologies such as Google Docs or Overleaf work well by allowing for interactive writing that facilitates version control (see also [21]). In contrast, papers written sequentially, in parallel by subsections, or by only one author may allow for using conventional programs such as Microsoft Word or LibreOffice. In any case, you should create a plan early on for version control, comments, and tracking changes. Regularly mark the version of the document, e.g., by including the current date in the file name. When working offline and distributing the document, add initials in the file name to indicate the progress and most recent editor. High-quality communication is important for efficient discussion on the paper’s content. When picking a virtual meeting technology, consider the number of participants permitted in a single group call, ability to record the meeting, audio and visual quality, and the need for additional features such as screencasting or real-time notes. Especially for large groups, it can be helpful for people who are not currently speaking to mute their microphones (blocking background noise), to use the video for nonverbal communication (e.g., to show approval or rejection and to help nonnative speakers), or to switch off the video when internet speeds are slow. More guidelines for effective virtual meetings are available in Hampton and colleagues [22]. In between virtual meetings, virtual technologies can help to streamline communication (e.g., https://slack.com) and can facilitate the writing process through shared to-do lists and task boards including calendar features (e.g., http://trello.com). With all technologies, accessibility, ease of use, and cost are important decision criteria. Note that some coauthors will be very comfortable with new technologies, whereas others may not be. Both should be ready to compromise in order to be as efficient and inclusive as possible. Basic training in unfamiliar technologies will likely pay off in the long term. Rule 7: Set clear timelines and adhere to them As for the overall research project, setting realistic and effective deadlines maintains the group’s momentum and facilitates on-schedule paper completion [23]. Before deciding to become a coauthor, consider your own time commitments. As a coauthor, commit to set deadlines, recognize the importance of meeting them, and notify the group early on if you realize that you will not be able to meet a deadline or attend a meeting. Building consensus around deadlines will ensure that internally imposed deadlines are reasonably timed [23] and will increase the likelihood that they are met. Keeping to deadlines and staying on task require developing a positive culture of encouragement within the team [14]. You should respect people’s time by being punctual for meetings, sending out drafts and the meeting agenda on schedule, and ending meetings on time. To develop a timeline, we recommend starting by defining the “final” deadline. Occasionally, this date will be set “externally” (e.g., by an editorial request), but in most cases, you can set an internal consensus deadline. Thereafter, define intermediate milestones with clearly defined tasks and the time required to fulfill them. Look for and prioritize strategies that allow multiple tasks to be completed simultaneously because this allows for a more efficient timeline. Keep in mind that “however long you give yourself to complete a task is how long it will take” [24] and that group scheduling will vary depending on the selected writing strategy (Rule 5). Generally, collaborative manuscripts need more draft and revision rounds than a “solo” article. Rule 8: Be transparent throughout the process This rule is important for the overall research project but becomes especially important when it comes to publishing and coauthorship. Being as open as possible about deadlines (Rule 7) and expectations (including authorship, Rule 4) helps to avoid misunderstandings and conflict. Be clear about the consequences if someone does not follow the group’s rules but also be open to rediscuss rules if needed. Potential consequences of not following the group’s rules include a change in author order or removing authorship. It should also be clear that a coauthor’s edits might not be included in the final text if s/he does not contribute on time. Bad experience from past collaboration can lead to exclusion from further research projects. As for collaboration [2], communication is key. During meetings, decide on a note taker who keeps track of the group’s discussions and decisions in meeting notes. This will help coauthors who could not attend the meeting as well as help the whole group follow up on decisions later on. Encourage everyone to provide feedback and be sincere and clear if something is not working—writing a multi-authored paper is a learning process. If you feel someone is frustrated, try to address the issue promptly within the group rather than waiting and letting the problem escalate. When resolving a conflict, it is important to actively listen and focus the conversation on how to reach a solution that benefits the group as a whole [25]. Democratic decisions can often help to resolve differing opinions. Rule 9: Cultivate equity, diversity, and inclusion Multi-authored papers will likely have a team of coauthors with diverse demographics and cultural values, which usually broadens the scope of knowledge, experience, and background. While the benefit of a diverse team is clear [14], successfully integrating diversity in a collaborative team effort requires increased awareness of differences and proactive conflict management [25]. You can cultivate diversity by holding members accountable to equity, diversity, and inclusivity guidelines (e.g., https://www.ryerson.ca/edistem/). If working across cultures, you will need to select the working language (both for verbal and written communications); this is most commonly the publication language. When team members are not native speakers in the working language, you should always speak slowly, enunciate clearly, and avoid local expressions and acronyms, as well as listen closely and ask questions if you do not understand. Besides language, be empathetic when listening to others’ opinions in order to genuinely understand your coauthors’ points of view [26]. When giving verbal or written feedback, be constructive but also be aware of how different cultures receive and react to feedback [27]. Inclusive writing and speaking provide engagement, e.g., “we could do that,” and acknowledge input between peers. In addition, you can create opportunities for expression of different personalities and opinions by adopting a participatory group model (e.g., [28]). Rule 10: Consider the ethical implications of your coauthorship Being a coauthor is both a benefit and a responsibility: having your name on a publication implies that you have contributed substantially, that you are familiar with the content of the paper, and that you have checked the accuracy of the content as best you can. To conduct a self-assessment as to whether your contributions merit coauthorship, start by revisiting authorship guidelines for your group (Rule 4). Be sure to verify the scientific accuracy of your contributions; e.g., if you contributed data, it is your responsibility that the data are correct, or if you performed laboratory or data analyses, it is your responsibility that the analyses are correct. If an author is accused of scientific misconduct, there are likely to be consequences for all the coauthors. Although there are currently no clear rules for coauthor responsibility [29], be aware of your responsibility and find a balance between trust and control. One of the final steps before submission of a multi-authored paper is for all coauthors to confirm that they have contributed to the paper, agree upon the final text, and support its submission. This final confirmation, initiated by the lead author, will ensure that all coauthors have considered their role in the work and can affirm contributions. It is important that you repeat the confirmation step each time the paper is revised and resubmitted. Set deadlines for the confirmation steps and make clear that coauthorship cannot be guaranteed if confirmations are not done. Conclusion When writing collaborative multi-authored papers, communication is more complex, and consensus can be more difficult to achieve. Our experience shows that structured approaches can help to promote optimal solutions and resolve problems around authorship as well as data ownership and curation. Clear structures are vital to establish a safe and positive environment that generates trust and confidence among the coauthors [14]. The latter is especially challenging when collaborating over large distances and not meeting face-to-face. Since there is no single “right approach,” our rules can serve as a starting point that can be modified specifically to your own team and project needs. You should revisit these rules frequently and progressively adapt what works best for your team and the project. We believe that the benefits of working in diverse groups outweigh the transaction costs of coordinating many people, resulting in greater diversity of approaches, novel scientific outputs, and ultimately better papers. If you bring curiosity, patience, and openness to team science projects and act with consideration and empathy, especially when writing, the experience will be fun, productive, and rewarding.

Related collections

Most cited references 22

Record: found
Abstract: not found
Article: not found

Procrastination, Deadlines, and Performance: Self-Control by Precommitment

Dan Ariely, Klaus Wertenbroch (2016)

0 comments Cited 152 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Beyond authorship: attribution, contribution, collaboration, and credit

Amy Brand, Liz Allen, Micah Altman … (2015)

0 comments Cited 120 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Ten simple rules for responsible big data research

Matthew A. Zook, Solon Barocas, Danah Boyd … (2017)

Introduction The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of the computational and natural sciences, to more directly address sensitive aspects of human behavior, interaction, and health. The tools of big data research are increasingly woven into our daily lives, including mining digital medical records for scientific and economic insights, mapping relationships via social media, capturing individuals’ speech and action via sensors, tracking movement across space, shaping police and security policy via “predictive policing,” and much more. The beneficial possibilities for big data in science and industry are tempered by new challenges facing researchers that often lie outside their training and comfort zone. Social scientists now grapple with data structures and cloud computing, while computer scientists must contend with human subject protocols and institutional review boards (IRBs). While the connection between individual datum and actual human beings can appear quite abstract, the scope, scale, and complexity of many forms of big data creates a rich ecosystem in which human participants and their communities are deeply embedded and susceptible to harm. This complexity challenges any normative set of rules and makes devising universal guidelines difficult. Nevertheless, the need for direction in responsible big data research is evident, and this article provides a set of “ten simple rules” for addressing the complex ethical issues that will inevitably arise. Modeled on PLOS Computational Biology’s ongoing collection of rules, the recommendations we outline involve more nuance than the words “simple” and “rules” suggest. This nuance is inevitably tied to our paper’s starting premise: all big data research on social, medical, psychological, and economic phenomena engages with human subjects, and researchers have the ethical responsibility to minimize potential harm. The variety in data sources, research topics, and methodological approaches in big data belies a one-size-fits-all checklist; as a result, these rules are less specific than some might hope. Rather, we exhort researchers to recognize the human participants and complex systems contained within their data and make grappling with ethical questions part of their standard workflow. Towards this end, we structure the first five rules around how to reduce the chance of harm resulting from big data research practices; the second five rules focus on ways researchers can contribute to building best practices that fit their disciplinary and methodological approaches. At the core of these rules, we challenge big data researchers who consider their data disentangled from the ability to harm to reexamine their assumptions. The examples in this paper show how often even seemingly innocuous and anonymized data have produced unanticipated ethical questions and detrimental impacts. This paper is a result of a two-year National Science Foundation (NSF)-funded project that established the Council for Big Data, Ethics, and Society, a group of 20 scholars from a wide range of social, natural, and computational sciences (http://bdes.datasociety.net/). The Council was charged with providing guidance to the NSF on how to best encourage ethical practices in scientific and engineering research, utilizing big data research methods and infrastructures [1]. 1. Acknowledge that data are people and can do harm One of the most fundamental rules of responsible big data research is the steadfast recognition that most data represent or impact people. Simply starting with the assumption that all data are people until proven otherwise places the difficulty of disassociating data from specific individuals front and center. This logic is readily evident for “risky” datasets, e.g., social media with inflammatory language, but even seemingly benign data can contain sensitive and private information, e.g., it is possible to extract data on the exact heart rates of people from YouTube videos [2]. Even data that seemingly have nothing to do with people might impact individuals’ lives in unexpected ways, e.g., oceanographic data that change the risk profiles of communities’ and properties’ values or Exchangeable Image Format (EXIF) records from photos that contain location coordinates and reveal the photographer’s movement or even home location. Harm can also result when seemingly innocuous datasets about population-wide effects are used to shape the lives of individuals or stigmatize groups, often without procedural recourse [3,4]. For example, social network maps for services such as Twitter can determine credit-worthiness [5], opaque recidivism scores can shape criminal justice decisions in a racially disparate manner [6], and categorization based on zip codes resulted in less access to Amazon Prime same-day delivery service for African-Americans in United States cities [7]. These high-profile cases show that apparently neutral data can yield discriminatory outcomes, thereby compounding social inequities. Other cases show that “public” datasets are easily adapted for highly invasive research by incorporating other data, such as Hague et al.’s [8] use of property records and geographic profiling techniques to allegedly identify the pseudonymous artist Banksy [9]. In particular, data ungoverned by substantive consent practices, whether social media or the residual DNA we continually leave behind us, may seem public but can cause unintentional breaches of privacy and other harms [9,10]. Start with the assumption that data are people (until proven otherwise), and use it to guide your analysis. No one gets an automatic pass on ethics. 2. Recognize that privacy is more than a binary value Breaches of privacy are key means by which big data research can do harm, and it is important to recognize that privacy is contextual [11] and situational [12], not reducible to a simple public/private binary. Just because something has been shared publicly does not mean any subsequent use would be unproblematic. Looking at a single Instagram photo by an individual has different ethical implications than looking at someone’s full history of all social media posts. Privacy depends on the nature of the data, the context in which they were created and obtained, and the expectations and norms of those who are affected. Understand that your attitude towards acceptable use and privacy may not correspond with those whose data you are using, as privacy preferences differ across and within societies. For example, Tene and Polonetsky [13] explore how pushing past social norms, particularly in novel situations created by new technologies, is perceived by individuals as “creepy” even when they do not violate data protection regulations or privacy laws. Social media apps that utilize users’ locations to push information, corporate tracking of individuals’ social media and private communications to gain customer intelligence, and marketing based on search patterns have been perceived by some to be “creepy” or even outright breaches of privacy. Likewise, distributing health records is a necessary part of receiving health care, but this same sharing brings new ethical concerns when it goes beyond providers to marketers. Privacy also goes beyond single individuals and extends to groups [10]. This is particularly resonant for communities who have been on the receiving end of discriminatory data-driven policies historically, such as the practice of redlining [14, 15]. Other examples include community maps—made to identify problematic properties or an assertion of land rights—being reused by others to identify opportunities for redevelopment or exploitation [16]. Thus, reusing a seemingly public dataset could run counter to the original privacy intents of those who created it and raise questions about whether it represents responsible big data research. Situate and contextualize your data to anticipate privacy breaches and minimize harm. The availability or perceived publicness of data does not guarantee lack of harm, nor does it mean that data creators consent to researchers using their data. 3. Guard against the reidentification of your data It is problematic to assume that data cannot be reidentified. There are numerous examples of researchers with good intentions and seemingly good methods failing to anonymize data sufficiently to prevent the later identification of specific individuals [17]; in other cases, these efforts were extremely superficial [18, 19]. When datasets thought to be anonymized are combined with other variables, it may result in unexpected reidentification, much like a chemical reaction resulting from the addition of a final ingredient. While the identificatory power of birthdate, gender, and zip code is well known [20], there are a number of other parameters—particularly the metadata associated with digital activity—that may be as or even more useful for identifying individuals [21]. Surprising to many, unlabeled network graphs—such as location and movement, DNA profiles, call records from mobile phone data, and even high-resolution satellite images of the earth—can be used to reidentify people [22]. More important than specifying the variables that allow for reidentification, however, is the realization that it is difficult to recognize these vulnerable points a priori [23]. Factors discounted today as irrelevant or inherently harmless—such as battery usage—may very well prove to be a significant vector of personal identification tomorrow [24]. For example, the addition of spatial location can turn social media posts into a means of identifying home location [25], and Google’s reverse image search can connect previously separate personal activities—such as dating and professional profiles—in unanticipated ways [26]. Even data about groups—“aggregate statistics”—can have serious implications if they reveal that certain communities, for example, suffer from stigmatized diseases or social behavior much more than others [27]. Identify possible vectors of reidentification in your data. Work to minimize them in your published results to the greatest extent possible. 4. Practice ethical data sharing For some projects, sharing data is an expectation of the human participants involved and thus a key part of ethical research. For example, in rare genetic disease research, biological samples are shared in the hope of finding cures, making dissemination a condition of participation. In other projects, questions of the larger public good—an admittedly difficult to define category—provide compelling arguments for sharing data, e.g., the NIH-sponsored database of Genotypes and Phenotypes (dbGaP), which makes deidentified genomic data widely available to researchers, democratizing access, or the justice claim made by the Institute of Medicine about the value of mandating that individual-level data from clinical trials be shared among researchers [28]. Asking participants for broad, as opposed to narrowly structured consent for downstream data management makes it easier to share data. Careful research design and guidance from IRBs can help clarify consent processes. However, we caution that even when broad consent was obtained upfront, researchers should consider the best interests of the human participant, proactively considering the likelihood of privacy breaches and reidentification issues. This is of particular concern for human DNA data, which is uniquely identifiable. These types of projects, however—in which rules of use and sharing are well governed by informed consent and right of withdrawal—are increasingly the exception rather than the rule for big data. In our digital society, we are followed by data clouds composed of the trace elements of daily life—credit card transactions, medical test results, closed-circuit television (CCTV) images and video, smart phone apps, etc.—collected under mandatory terms of service rather than responsible research design overseen by university compliance officers. While we might wish to have the standards of informed consent and right of withdrawal, these informal big data sources are gathered by agents other than the researcher—private software companies, state agencies, and telecommunications firms. These data are only accessible to researchers after their creation, making it impossible to gain informed consent a priori, and contacting the human participants retroactively for permission is often forbidden by the owner of the data or is impossible to do at scale. Of course, researchers within software companies and state institutions collecting these data have a special responsibility to address the terms under which data are collected; but that does not exempt the end-user of shared data. In short, the burden of ethical use (see Rules 1 to 3) and sharing is placed on the researcher, since the terms of service under which the human subjects’ data were produced can often be extremely broad with little protection for breaches of privacy. In these circumstances, researchers must balance the requirements from funding agencies to share data [29] with their responsibilities to the human beings behind the data they acquired. A researcher needs to inform funding agencies about possible ethical concerns before the research begins and guard against reidentification before sharing. Share data as specified in research protocols, but proactively address concerns of potential harm from informally collected big data. 5. Consider the strengths and limitations of your data; big does not automatically mean better In order to do both accurate and responsible big data research, it is important to ground datasets in their proper context including conflicts of interests. Context also affects every stage of research: from data acquisition, to cleaning, to interpretation of findings, and dissemination of the results. During the step of data acquisition, it is crucial to understand both the source of the data and the rules and regulations with which they were gathered. This is especially important in cases of research conducted in relatively loose regulatory environments, in which use of answers to research questions may conflict with the expectations of those who provided the data. One possible approach might be the ethical norms employed to track the provenance of artifacts, often in cooperation and collaboration with the communities from which they come (e.g., archaeologists working in indigenous communities to determine the disposition of material culture). In a similar manner, computer scientists use data lineage techniques to track the evolution of a dataset and often to trace bugs in the data. Being mindful of the data’s context provides the foundation for clarifying when your data and analysis are working and when they are not. While it is tempting to interpret findings based on big data as a clear outcome, a key step within scientific research is clearly articulating what data or an indicator represent and what they do not. Are your findings as clear-cut if your interpretation of a social media posting switches from a recording of fact to the performance of a social identity? Given the messy, almost organic nature of many datasets derived from social actions, it is fundamental that researchers be sensitive to the potential multiple meanings of data. For example, is a Facebook post or an Instagram photo best interpreted as an approval/disapproval of a phenomenon, a simple observation, or an effort to improve status within a friend network? While any of these interpretations are potentially valid, the lack of context makes it even more difficult to justify the choice of one understanding over another. Reflecting on the potential multiple meanings of data fosters greater clarity in research hypotheses and also makes researchers aware of the other potential uses of their data. Again, the act of interpretation is a human process, and because the judgments of those (re)using your data may differ from your own, it is essential to clarify both the strengths and shortcomings of the data. Document the provenance and evolution of your data. Do not overstate clarity; acknowledge messiness and multiple meanings. 6. Debate the tough, ethical choices Research involving human participants at federally funded institutions is governed by IRBs charged with preventing harm through well-established procedures and are familiar to many researchers. IRBs, however, are not the sole arbiter of ethics; many ethical issues involving big data are outside of their governance mandate. Precisely because big data researchers often encounter situations that are foreign to or outside of the mandate of IRBs, we emphasize the importance of debating the issues within groups of peers. Rather than a bug, the lack of clear-cut solutions and governance protocols should be more appropriately understood as a feature that researchers should embrace within their own work. Discussion and debate of ethical issues is an essential part of professional development—both within and between disciplines—as it can establish a mature community of responsible practitioners. Bringing these debates into coursework and training can produce peer reviewers who are particularly well placed to raise these ethical questions and spur recognition of the need for these conversations. A precondition of any formal ethics rules or regulations is the capacity to have such open-ended debates. As digital social scientist and ethicist Annette Markham [30] writes, “we can make [data ethics] an easier topic to broach by addressing ethics as being about choices we make at critical junctures; choices that will invariably have impact.” Given the nature of big data, bringing technical, scientific, social, and humanistic researchers together on projects enables this debate to emerge as a strength because, if done well, it provides the means to understand the ethical issues from a range of perspectives and disrupt the silos of disciplines [31]. There are a number of good models for interdisciplinary ethics research, such as the trainings offered by the Science and Justice research center at the University of California, Santa Cruz [32] and Values in Design curricula [33]. Research ethics consultation services, available at some universities as a result of the Clinical and Translational Science Award (CTSA) program of the National Institutes of Health (NIH), can also be resources for researchers [34]. Some of the better-known “big data” ethical cases—i.e., the Facebook emotional contagion study [35]—provide extremely productive venues for cross-disciplinary discussions. Why might one set of scholars see this as a relatively benign approach while other groups see significant ethical shortcomings? Where do researchers differ in drawing the line between responsible and irresponsible research and why? Understanding the different ways people discuss these challenges and processes provides an important check for researchers, especially if they come from disciplines not focused on human subject concerns. Moreover, the high visibility surrounding these events means that (for better or worse) they represent the “public” view of big data research, and becoming an active member of this conversation ensures that researchers can give voice to their insights rather than simply being at the receiving end of policy decisions. In an effort to help these debates along, the Council for Big Data, Ethics, and Society has produced a number of case studies focused specifically on big data research and a white paper with recommendations to start these important conversations ( http://bdes.datasociety.net/output/ ). Engage your colleagues and students about ethical practice for big data research. 7. Develop a code of conduct for your organization, research community, or industry The process of debating tough choices inserts ethics directly into the workflow of research, making “faking ethics” as unacceptable as faking data or results. Internalizing these debates, rather than treating them as an afterthought or a problem to outsource, is key for successful research, particularly when using trace data produced by people. This is relevant for all research including those within industry who have privileged access to the data streams of digital daily life. Public attention to the ethical use of these data should not be avoided; after all, these datasets are based on an infrastructure that billions of people are using to live their lives, and there is a compelling public interest that research is done responsibly. One of the best ways to cement this in daily practice is to develop codes of conduct for use in your organization or research community and for inclusion in formal education and ongoing training. The codes can provide guidance in peer review of publications and in funding consideration. In practice, a highly visible case of unethical research brings problems to an entire field, not just to those directly involved. Moreover, designing codes of conduct makes researchers more successful. Issues that might otherwise be ignored until they blow up—e.g., Are we abiding by the terms of service or users’ expectations? Does the general public consider our research “creepy”? [13]—can be addressed thoughtfully rather than in a scramble for damage control. This is particularly relevant to public-facing private businesses interested in avoiding potentially unfavorable attention. An additional and longer-term advantage of developing codes of conduct is that it is clear that change is coming to big data research. The NSF funded the Council for Big Data, Ethics, and Society as a means of getting in front of a developing issue and pending regulatory changes within federal rules for the protection of human subjects that are currently under review [1]. Actively developing rules for responsible big data research within a research community is a key way researchers can join this ongoing process. Establish appropriate codes of ethical conduct within your community. Make industry researchers and representatives of affected communities active contributors to this process. 8. Design your data and systems for auditability Although codes of conduct will vary depending on the topic and research community, a particularly important element is designing data and systems for auditability. Responsible internal auditing processes flow easily into audit systems and also keep track of factors that might contribute to problematic outcomes. Developing automated testing processes for assessing problematic outcomes and mechanisms for auditing other's work during review processes can help strengthen research as a whole. The goal of auditability is to clearly document when decisions are made and, if necessary, backtrack to an earlier dataset and address the issue at the root (e.g., if strategies for anonymizing data are compromised). Designing for auditability also brings direct benefits to researchers by providing a mechanism for double-checking work and forcing oneself to be explicit about decisions, increasing understandability and replicability. For example, many types of social media and other trace data are unstructured, and answers to even basic questions such as network ties, location, and randomness depend on the steps taken to collect and collate data. Systems of auditability clarify how different datasets (and the subsequent analysis) differ from each other, aiding understanding and creating better research. Plan for and welcome audits of your big data practices. 9. Engage with the broader consequences of data and analysis practices It is also important for responsible big data researchers to think beyond the traditional metrics of success in business and the academy. For example, the energy demands for digital daily life, a key source of big data for social science research, are significant in this era of climate change [36]. How might big data research lessen the environmental impact of data analytics work? For example, should researchers take the lead in asking cloud storage providers and data processing centers to shift to sustainable and renewable energy sources? As important and publicly visible users of the cloud, big data researchers collectively represent an interest group that could rally behind such a call for change. The pursuit of citations, reputation, or money is a key incentive for pushing research forward, but it can also result in unintended and undesirable outcomes. In contrast, we might ask to what extent is a research project focused on enhancing the public good or the underserved of society? Are questions about equity or promoting other public values being addressed in one’s data streams, or is a big data focus rendering them invisible or irrelevant to your analysis [37]? How can increasingly vulnerable yet fundamentally important public resources—such as state-mandated cancer registries—be protected? How might research aid or inhibit different business and political actors? While all big data research need not take up social and cultural questions, a fundamental aim of research goes beyond understanding the world to considering ways to improve it. Recognize that doing big data research has societal-wide effects. 10. Know when to break these rules The final (and counterintuitive) rule is the charge to recognize when it is appropriate to stray from these rules. For example, in times of natural disaster or a public health emergency, it may be important to temporarily put aside questions of individual privacy in order to serve a larger public good. Likewise, the use of genetic or other biological data collected without informed consent might be vital in managing an emerging disease epidemic. Moreover, be sure to review the regulatory expectations and legal demands associated with protection of privacy within your dataset. After all, this is an exceedingly slippery slope, so before following this rule (to break others), be cautious that the “emergency” is not simply a convenient justification. The best way to ensure this is to build experience in engaging in the tough debates (Rule 6), constructing codes of conduct (Rule 7), and developing systems for auditing (Rule 8). The more mature the community of researchers is about their processes, checks, and balances, the better equipped it is to assess when breaking the rules is acceptable. It may very well be that you do not come to a final clear set of practices. After all, just as privacy is not binary (Rule 2), neither is responsible research. Ethics is often about finding a good or better, but not perfect, answer, and it is important to ask (and try to answer) the challenging questions. Only through this engagement can a culture of responsible big data research emerge. Understand that responsible big data research depends on more than meeting checklists. Conclusion The goal of this set of ten rules is to help researchers do better work and ultimately become more successful while avoiding larger complications, including public mistrust. To achieve this, however, scholars must shift from a mindset that is rigorous when focused on techniques and methodology and naïve when it comes to ethics. Statements to the effect that “Data is [sic] already public” [38] are unjustified simplifications of much more complex data ecosystems embedded in even more complex and contingent social practices. Data are people, and to maintain a rigorously naïve definition to the contrary [18] will end up harming research efforts in the long run as pushback comes from the people whose actions and utterances are subject to analysis. In short, responsible big data research is not about preventing research but making sure that the work is sound, accurate, and maximizes the good while minimizing harm. The problems and choices researchers face are real, complex, and challenging and so too must be our response. We must treat big data research with the respect that it deserves and recognize that unethical research undermines the production of knowledge. Fantastic opportunities to better understand society and our world exist, but with these opportunities also come the responsibility to consider the ethics of our choices in the everyday practices and actions of our research. The Council for Big Data, Ethics, and Society ( http://bdes.datasociety.net/ ) provides an initial set of case studies, papers, and even ten simple rules for guiding this process; it is now incumbent on you to use and improve these in your research.