Open Access: Taking Full Advantage of the Content

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

This Journal and the Public Library of Science (PLoS) at large are standard bearers of the full potential offered through open access publication, but what of you, the reader? For most of you, open access may imply free access to read the journals, but nothing more. There is a far greater potential, but, up to now, little to point to that highlights its tangible benefits. We would argue that, as yet, the full promise of open access has not been realized. There are few persistent applications that collectively use the full on-line corpus, which for the biosciences at least is maintained in PubMed Central (http://www.pubmedcentral.nih.gov/). In short, there are no “killer apps.” Since this readership, beyond any other, would seem to have the ability to change this situation at least in the biosciences, we are issuing a call to action. While, first and foremost, open access implies downloading and reading full papers for free, additional possibilities exist depending on how the open access material is licensed. PLoS and BioMedCentral (BMC), for example, publish under a Creative Commons Attribution License (CCAL). Under this license authors retain ownership of the copyright for their article, but they allow anyone (commercial or non-commercial) to download, reuse, reprint, modify, distribute, and/or copy articles, as long as the original authors and source are cited. No permission is required from the authors or the publishers. Note that, while this is what PLoS and BMC mean by open access, it is not what other publishers mean, such as the National Academy of Sciences (NAS) in publishing the Proceedings of the National Academy of Sciences (PNAS) or Oxford University Press (OUP) in publishing the journal Bioinformatics. In these two examples, it means free to read, but with variation in what is implied by copyright. For PNAS, authors have full rights for print use and readers can freely use figures and tables (with attribution); and for Bioinformatics, a Creative Commons license applies, but only for non-commercial use. This issue was recently addressed in more detail in a PLoS Biology Editorial [1]. The key point is that these licenses allow us to go far beyond reading material to manipulating it much like data. Beyond what the licensing laws say about how we might use open access materials, there is then the format in which these materials are available. Papers published as PDFs do not lend themselves to easy manipulation by computer. HTML is better, but the markup has more to do with presentation on a Web page than the semantic content of the paper, which is where the great opportunities lie. XML versions of the paper offer the most promise. When publishers make XML versions available, most conform to the National Library of Medicine (NLM) Document Type Definition (DTD) (http://dtd.nlm.nih.gov). In addition, several markup languages have been developed, such as CellML (http://www.cellml.org) and MathML (http://www.w3.org/Math), which can be used in addition to the NLM DTD to further describe the semantic content of a paper. Semantically aware markup is further elaborated in a systematic fashion in the construction of the semantic Web [2], where the XML tags are related to each other in explicit ontologies. The analogy between an XML file of content offered by a publisher and XML content provided by a database provider should not be missed. As a community, we have been at the forefront of using the latter; will we be at the forefront of using the former? While the DTD and markup languages provide for extensions to meet the needs of each discipline, publishers and researchers have made little use of them to date. This is somewhat of a chicken-and-egg situation. When significant markup is available, it will be used; then again, why go to the trouble of adding significant markup if there are no applications demanding it? The best way out would seem to be to do something significant with the markup we have, which may then inspire authors, publishers, and others to see the research and commercial potential of the corpus. The use of such markup is a hallmark of Web 2.0 and is manifest in the idea of a mashup. Simply put, a mashup is an integration of Web content from multiple sources to provide a new and more powerful service beyond what can be achieved by any of the individual sources of information it comprises. This type of integration is facilitated if the semantic content from each information source can be identified and thus allow meaningful integration to take place. Specifically in relation to publishing, the mashup manifests the blurring of the distinction between databases and journals, which will continue in future [3],[4]. We already have a significant corpus from a variety of publishers sitting in PubMed Central that is ripe for mashup and other uses. Certainly, the growth rate of the archive hoped for by the NIH has not been met at this time [5], but new laws in the US and elsewhere are changing this situation. Something significant can be done with what we have—so where are the killer apps? Consider the following applications from our own laboratories. They may not be killer applications, but they begin to illustrate what can be done with this online corpus. The key idea is manipulation of article text as “data” and integration of articles with other bioinformatics information resources. Firstly, BioLit (http://biolit.ucsd.edu) attempts to bridge the database and journal worlds [6]. Databases are rich in semantics, which are most often manifest in the form of a database schema with associated referential integrity to strictly impose access to those semantics. On the other hand, journal text, as we have seen above, is generally bereft of controlled access to those semantics. Nevertheless, the results of natural language processing and unique terms like database identifiers found in full journal text can be used to extract some semantic meaning and impose useful markup. This opens up the possibility of integrating database and literature content, which is one goal of BioLit, using the PLoS corpus and the RCSB Protein Data Bank (PDB; http://www.pdb.org) as a test bed. Of course, the best way to introduce semantic markup into a journal article is to capture it at the time the article is written. To do this is another goal of the BioLit project, in collaboration with Microsoft. In the same way that a spellchecker compares every word of a written article, suggesting changes as needed, a semantic checker can use existing ontologies and pseudonym tables to suggest formal definitions and subsequent tagging of semantically relevant content for a variety of uses, for example, integration with database content and more directed searching. Open access literature provides a rich dataset to experiment with these ideas. PubNet visualizes relationships based on the results of a PubMed query (http://pubnet.gersteinlab.org) [7]. Using a standard PubMed style query, articles can be retrieved and associations developed by further retrievals. Associations are presented as graphs where nodes represent the terms and edges represent the relationship between them. A favorite query is to construct your own publication net that shows all your co-authors and how they have published with you and each other. A more generic example can be found in a recent article that showed the emergence of the RNAi field and the interrelationship of authors publishing related work in this field [8]. Associations can be made between data items such as PDB identifiers, UniProt identifiers, and the like. PubNet operates on PubMed XML output, which includes only the publication details and abstract of the paper, so it is not taking advantage of the full text of the paper. However, it could be readily expanded to do so if the rest of the paper were included in the XML output. It is easy to imagine how connections between results and specific entities (like protein identifiers) across a large body of literature can begin to yield interesting and provocative relationships. SciVee (http://www.scivee.tv) [6] caters to the YouTube generation of video consumers; after all, they are the next Nobel Laureates. Using PLoS and other content taken from PubMed Central, SciVee provides a video-on-demand service that mashes up video provided by the authors and the paper content into what is called a “pubcast.” As the growing body of scientific literature threatens to overwhelm us, we are faced with either an abstract, which is consumed in a minute or two, or a full paper, which may take two to three hours to absorb in detail. SciVee's notion is that an intermediate amount of content is needed. Who better to provide this intermediate view than one of the authors by giving a five-to-ten-minute video presentation of the content of the paper? If only the abstract of the paper is available, the story ends there—a video and abstract side by side. If the full text of an open access article is available, additional possibilities emerge. The paper may be synchronized with the video, so as the author talks, appropriate tables, figures, and text are brought into view (see http://www.scivee.tv/node/5275 for an example). Alternatively, upon a single click, the author may pop up and explain a particular segment in more detail. Authors of accepted PLoS papers are invited to make video segments and upload them to the SciVee Web site. This can be done using a webcam and software standard on a PC or Mac, or done more professionally. Our experience has been that they cost about US$150 at our home institutions using one of the available media services—just a natural evolution from the days when we used to make 35 mm slides for a presentation. However, unlike slides which were viewed a few times by a select audience, pubcasts are viewable by a worldwide audience at any time. We do know already that the availability of online synchronized open access content generates interest in the online version of a paper, perhaps bringing a new audience to the work, and it remains to be seen how it improves the comprehension and learning experience. Podcasts may be what the reader is seeking when video seems like overkill. Audio tracks could be associated with major figures or other visual elements taken from the open access paper. Perhaps a podcast of the traditional journal issue is desirable: while jogging or walking to the laboratory you could get an overview of the latest issue of this journal, presented either by the authors of papers in that issue or by a journal editor. This takes eToCs to a new level and medium. It seems that every student walking around campus has the means in their hands and ears to take advantage of this today. This could also benefit scientists with disabilities. Science, Nature, and other journals are using podcasts regularly, and they seem to be well received. Certainly open access journals, such as the PLoS journals, have an opportunity to try and develop those killer apps. PLoS is using the TOPAZ application framework for a publication application built on a semantic repository. TOPAZ allows users to add notes directly to the article content and to add comments to the article. The published article then becomes the basis for an evolving discussion within the scientific community rather than a static article. The user notes are also stored as relationships to the article which can be later mined to uncover new connections in the research. The journals PLoS ONE and PLoS Neglected Tropical Diseases are published using the TOPAZ application framework, and other PLoS journals have just started using the same framework. Another long-term notion at PLoS is that of portals or hubs in which selected materials from across the journals (and from open access literature as a whole) can be brought together by readers to form their own personalized view of the literature, or by special interest groups to share with emerging communities. Let us consider some other opportunities, hopefully to whet your appetite for creating your own killer apps. So far, open access publications have been viewed by their readership (and often by their publishers) in very traditional ways. That may be changing; consider the ability to comment on a paper. This journal now offers readers the ability to comment on any aspect of a published paper for all to read, and we certainly invite you to comment on this Editorial. Many of you may not think twice in sending a comment to a list server or blog; however, you may perceive that as a different medium with a different social or professional context, and it may provide anonymity. Perhaps a video about a paper as described above can also overcome the stigma about rating a paper itself? Certainly rating a paper would seem reasonable when done by the Faculty of 1000 (http://www.f1000biology.com), but it is not a generally accepted practice. We challenge you to rate this Editorial too. In some ways the reluctance to rate a scientific paper is strange since we suspect the same person may well rate a book on amazon.com. Another option would be to add a Digg or del.icio.us button (http://digg.com or http://del.icio.us) to incorporate conventional media ranking tools into an academic journal Web site. If one finds an interesting article, one could immediately flag it with these tools. The New York Times, PNAS, and many other publications already offer this possibility, which would be an interesting vehicle for us authors and readers, both to get quick user feedback on interesting articles and to leverage mainstream tools. Taking this a step further is to introduce the idea of folksonomy, where readers themselves tag the articles with semantically useful (and hopefully) controlled terms as a way to provide semantic content. In the life sciences this is simply an extension of what annotators at the National Library of Medicine do in associating Medical Subject Headings (MeSH) to papers, including those in this journal. The difference proposed here is that the content is controlled by the community of readers. A related concept, which has been nominally explored by Nature [9] and others, is giving the referee the option to make his or her review public. In addition to communicating comments exclusively to the editor and to the authors (usually anonymously), one could also elect to have one's referee report, or parts of it, made public on the Web with the published article, either in a personalized or in an anonymous fashion. This would generate an incentive for referees, allowing them to get recognition for their work, as readers would see directly the referees' names and their comments associated with each article. We could allow authors to post their formal response to referees on the journal Web site as well. Referees and authors make tremendous efforts putting together reports and responses, and making them publicly available would be a way for the journal and the community as a whole to get some additional value from this content by providing direct commentary on the article's strengths and weaknesses and by giving didactic clues to students and post-docs. We feel open review has the possibility of improving the review process immensely, but also expect objections from some authors and reviewers. These are a few ideas that we have come up with for making use of the wealth of knowledge contained in open access articles. We feel that it is now time for the community represented by this readership to act. What say you? It is important we hear from you on the subject of better use of open access content. At the forthcoming Intelligent Systems in Molecular Biology Conference there will be a session on Scientific Publishing where these views will be discussed, and we also encourage feedback via e-mail, blog, or article comment.

Related collections

Most cited references 10

Record: found
Abstract: not found
Article: not found

The mice that warred.

G Stix (2001)

0 comments Cited 49 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Will a Biological Database Be Different from a Biological Journal?

Philip C. Bourne (2005)

The differences, or otherwise, between biological databases and journals is an important question to consider as we ponder the future dissemination and impact of science. If databases and journals remain discrete, our methods of assimilating information will change relatively little in the years to come. On the other hand, if databases and journals become more integrated, the way we do science could change significantly. As both Editor-in-Chief of PLoS Computational Biology and Codirector of the Protein Data Bank (PDB), one of the oldest and widely used data resources in molecular biology, the question is particularly pertinent. Here, I give my perspective on what could and, I believe, should happen in the future. My vision is that a traditional biological journal will become just one part of various biological data resources as the scientific knowledge in published papers is stored and used more like a database. Conversely, the scientific literature will seamlessly provide annotation of records in the biological databases. Imagine reading a description of an active site of a biological molecule in a paper, being able to access immediately the atomic coordinates specifically for that active site, and then using a tool to explore the intricate set of hydrogen-bonding interactions described in the paper. Not only are the data generated by the experiment immediately available within the context of what you are reading, but specific tools for interpreting these data are provided by the journal. Alternatively, if you are starting with the data, for example, viewing the chromosome location of a human single-nucleotide polymorphism associated with a neurological disorder, you can immediately access a variety of papers ranked in order of relevance to your profile, not just through links to abstracts but also by pinpointing the reference to the single-nucleotide polymorphism in the full-text article. The type and order of articles displayed could be different, depending on whether you are, for example, a molecular biologist or a neurosurgeon. At this point, whatever your user profile, the distinction between a database and a journal article disappears. How could this happen? To answer this question, we must think about the parallels that exist today between biological databases and biological journals. The daily work of any high-throughput scientific journal or biological database consists of information input, information processing, and information output. Consider the parallels between a journal and a database for each of these three steps. On a daily basis, the journal accepts manuscripts; once these have been checked for format compliance and completeness, they undergo review, either by an internal group of scientific editors or, as is the case for PLoS Computational Biology, through peer review by the scientific community. Likewise, a biological database such as the PDB accepts submissions from the community, which are checked for format compliance and reviewed internally by experienced annotators. There are even parallel presubmission steps in journals and databases. For example, potential authors in PLoS Computational Biology may make presubmission inquiries to confirm the suitability of their paper, and depositors to the PDB may run their entries against a validation server to determine whether the data are in compliance, prior to having the same tests run by a PDB annotator. Once registered with the corresponding online submission system, a journal manuscript receives a permanent manuscript number, while a database entry receives a unique identifier. Subsequent revisions can be mapped to these respective numbers, so that both journals and databases can provide an accurate audit trail of journal manuscripts and database entries, respectively. Once a manuscript or entry is accepted as compliant, both undergo review processes involving one or more iterative steps between institution and author, as the manuscript or the entry is refined and finally released. Release cycles of journals and databases have also become similar—journals such PLoS Computational Biology have an option for early online release as soon as the manuscript is accepted, and biological databases typically release entries on a daily or weekly basis, as soon as they have been processed. Not only are the daily operations of databases and journals similar, but the business models also have parallels (I will not dwell on them here though). Certainly from a consumer's perspective, in terms of accessibility, there is no difference between a paper in a PLoS journal and an entry in the PDB database—they are freely available to all. In the case of open-access journals and open archives like the PDB, the parallels, from the perspective of the consumer, are even more profound than just free access yet are frequently overlooked. PLoS articles are published under a Creative Commons Attribution License, which means that the contents (text and images) of all PLoS journals can be used as the consumer sees fit, provided original attribution is given to the appropriate authors and source. So it is with the contents of many biological databases, including the PDB. Consumers are free to take and analyze the contents by any means they see fit, but are expected to attribute information to the authors of the original material, as appropriate. Finally, in the case of PLoS journals, the copyright of the material is not signed over to the publisher but remains with the original author, which is also true of information provided to most biological databases. In both forms of open access—journals and databases—the only requirement is to provide an immutable reference to the material. In the case of an online journal article, this reference most often takes the form of a digital object identifier (DOI), and for a database entry, it is usually a unique accession number. Like the contents of manuscripts and database entries, I expect these two forms of immutable identifiers to become indistinguishable from each other, as I will outline subsequently. Given these parallels, at this point in time, what is the difference between an entry in a database and an article in a journal? Currently the difference can be characterized as a mix of perception and content. Clearly, no one perceives a database entry of, say, a sequence, or a specimen in a museum collection, as being as valuable as the journal paper that describes it. But, ironically, to the consumer, at least by one measure, the database entry may indeed be more valuable. The structure of human deoxyhemoglobin is one of the most downloaded structures in the PDB—in one year, it has been downloaded more times than the original paper has ever been cited thus far. Yet from the authors' perspective, the Nobel Prize does not come from constructing the PDB database entry, but from an eloquent description of the relationship between structure and function that was presented most completely in the literature. A tenure committee does not award tenure based on the number of deposits a faculty member has made to a biological database, but rather the number of papers they have published in leading journals. Those of you who have made it this far might be thinking it is ridiculous that I should regard the content of a database entry in the same way that I regard the content of a scientific paper, given these differences in perception and content. It is possible, though, that you are thinking this way based on traditional perceptions of content and not the way things should be, going forward, given current technologies and social practices. To set the stage for the subsequent discussion, I will highlight three current observations that are relevant to this assertion. First, publishers have embraced the Internet as a distribution medium but, for the most part, have not used the medium beyond that, simply distributing material in the same way as in printed form. Hyperlinks in documents and citation indexes are exceptions, but compared to what many biological database developers have achieved in terms of information integration and comprehension through novel display techniques, such added functionality is minimal. Second, online journals have greatly reduced the necessity for page limits on papers, since the costs of supporting a long versus short paper are much less online than in the printed form. Journals publishing both online and in print solve this size problem by having short articles in print and placing additional material as supplements in an online form only. This practice has increased dramatically in the past few years: consider the amount of supplementary material in one issue of the Proceedings of the National Academy of Sciences of the United States of America today versus five years ago. Supplementary material can be a valuable addition or, alternatively, can make for a disjointed piece of work. Moreover, the supplemental material is ad hoc and cannot be readily queried across all articles, even though a small amount of it is already tagged and comes directly from a database. Third, the perceived value of both a database entry and a journal article has changed over the years. As high-throughput techniques have become more prevalent, data are produced at an ever-increasing rate, so the value of a unit of data, for example, a sequence or structure, has diminished. Data producers hoard their data less than they did in past years. Similarly, the rate of publication has increased dramatically, this increase being brought about by accelerated technologies for manuscript production, large collaborative studies, and increased emphasis on the notion of “publish or perish.” In short, journal content is already becoming more like database content and vice versa. Can this trend continue? Consider how the respective content of journals and databases is organized. Both have varying degrees of content organization. Papers have structure, but the organization of their content is less detailed than that found in a database, although this is changing with formal document type definitions being applied, from which database schema can be generated. Typically a paper has an introduction, a materials and methods section, a results section, and a discussion section; it possibly uses consistent terms for genes, enzymes, and diseases; and in a post-production step, keywords and/or medical subject headings for indexing the content of the article are added. Databases, on the other hand, frequently have a high level of organization, where data are granular and each granule is described in exquisite detail. The advantage of a paper is that it is relatively easy to input and maintain, but it requires human recall. Machine-based recall of meaningful information is poor, a problem being addressed but certainly not solved by the discipline of natural-language processing. A database, on the other hand, has excellent recall but requires much effort to organize and is best suited to quantitative data, not free text. I would contend that the future offers some middle ground for content organization. We have taken the first steps toward a middle ground by making both the combined contents of biological databases and biological literature freely available in electronic form. Is the technology available to support the next steps in integration and is the scientific community ready for such a change? I believe that the answer to the technology part of the question is yes. I do not know the answer to the second part, but I think it's time for some preliminary experiments to find out. I would be most interested in hearing views on the matter and any suggestions for potential experiments. In the interim, here are a few experiments I am proposing. As mentioned above, DOIs provide an immutable reference to a scientific document that exists online. The way I think about DOIs is the same way I think about addresses used to identify computers on the internet, each address possesses a unique identifier that in a seamless way can be resolved to access that specific computer. So it is with DOIs, which can be resolved not only to find the material referenced by the DOI but, through reverse searching, can also be used to find material that references the DOI. Think of what could happen if such DOIs were not only assigned to papers as they are now, but also to items of content within biological databases—protein structures, species distributions, neuroimaging datasets, and so on—and if these DOIs were referenced when that content was used or discussed elsewhere. An immediate outcome would be the ability to find all papers that reference a particular sequence motif, for example: a level of detail that is not currently available to someone accessing a sequence database. Conversely, accessing a paper would immediately provide a resolvable list of the sources of data used in the experiments, which could be accessed and further analyzed—a step toward achieving true reproducibility of an experiment, where the paper has become the interface to the data. Unfortunately, DOIs cost money, and providing a fine level of granularity, such as all sequence motifs for every sequence in the Protein Families Database of Alignments and HMMs, would be prohibitively expensive. Publishers should collaborate with the major database providers, so that database providers provide the appropriate immutable references and published articles reference them. As another experiment, what if the data in an online paper became more alive? Some databases let you download data into spreadsheets or other client-side applications that render and analyze data. Papers could be treated this way, too. The technology is there to create these ubiquitous clients that are independent of operating systems and hardware and that are downloadable on demand. New levels of comprehension might be achievable. The first step would be to provide tools that better visualize specific types of biological data, without the need for specialized knowledge in using an esoteric tool. Later would come tools for basic analysis, for example, simple statistical tests or principle-component analysis. Consider one final experiment, what if papers were made to show a higher level of organization than is possible today? Clearly, too much additional work by the author would be resisted, unless it bought clear rewards. Nevertheless, tools can be envisaged that, with minimal work by the author, would further classify the text such that, for example, annotation associated with a particular gene or set of genes is identified, or a set of keywords is generated to be associated with the paper as metadata, and all the author would have to do is confirm their validity. Recent benchmarks indicated that 80% of terms such as gene names could be identified automatically and hence associated with systematic annotation, which could simply be accepted or rejected by the author [1]. Would an author do it, if it led to more rapid citations? I would say so! This type of experiment has already proved to be successful in the community engaged in small-molecule structure determination, although without the data being publicly accessible in an easy way. With the incentive for more citations, the author would review the proposed systematic nomenclature, and we would then have the potential for a new association between the text of a paper and, say, a gene and the description of that gene in a database. If the connection is transparent to the reader, the paper has thus become a detailed entry point to the database and the database has become a detailed entry point to the literature. These experiments, if successful, would go a long way in answering the question posed here—Is a biological database any different than a biological journal? I am working toward reaching an answer of, no, there is no difference. If you want to help answer this question, I would welcome hearing from you; after all, journals, like databases, should be community resources.

0 comments Cited 34 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

PubNet: a flexible system for visualizing literature derived networks

Shawn M. Douglas, Gaetano T. Montelione, Mark Gerstein (2005)

Rationale The amount of widely accessible scientific data has increased dramatically in recent years. There are currently more than 31,000 structures in the Protein Data Bank (PDB) [1], as compared with 3,000 structures 10 years ago. Swiss-Prot [2] now contains more than 178,000 sequence entries, which is up from 40,000 in 1994. With continual advances and refinements of experimental and computational technologies, data creation promises to accelerate for the foreseeable future. PubMed [3] stands out as a key information resource in the biological sciences in terms of diversity, breadth, and manual curation. PubMed entries comprise an order of magnitude more data than the three billion bases of the human genome. In addition to basic citation and abstract information, PubMed provides rich meta-information including Medical Subject Headings (MeSH) terms, detailed affiliation, and any secondary source databanks and accession numbers of molecules discussed in each article. By parsing the XML output of a query and performing a few simple operations, it is possible to uncover many interesting relationships among publications. Previous work has been done to augment or refine the standard PubMed search, including tools to conduct combinatorial searches [4] and to navigate standard search results based on common MeSH terms [5], gene names found in abstracts [6,7], PubMed-assigned 'related articles' [8], and combinations thereof [9-12]. In PubNet we present a unique two-pronged approach in which network graphs are dynamically rendered to provide an intuitive and complete view of search results, while hyperlinking to a textual representation to allow detailed exploration of a point of interest. Multiple simultaneous queries are also supported, greatly increasing the number and types of relationships that can be visualized. The PubNet server, source code, and gallery are available on the worldwide web [13]. How PubNet works and interpreting the output Visualizing a publication extracted network is done by entering at least one PubMed query into the provided textbox, selecting node and edge parameters, and clicking 'Submit' (Figure 1a). Each query is relayed to PubMed, and so all standard PubMed syntactical conventions apply. The PubMed XML output is parsed and the network graphs are drawn with the aid of aiSee graph visualization software [14]. The simplest PubNet example is the network relating papers by shared authorship, generated from a single query (Figure 1b and 1c). In this example, there is a one-to-one correspondence between the number of papers returned by the query and the number of nodes drawn on the graph. Each pair of papers is then linked by an edge if they share at least one common author, and edges are scaled in thickness for multiple common authors. Much more complex networks can be derived by entering two queries and selecting node parameters for which there may be a one-to-many correspondence between papers returned by PubMed and nodes associated with each paper (Figure 1d-f). As is often the case when nodes are set to Author or Databank ID, each publication returned by each query will expand to several nodes in the final network display. Nodes are colored according to the query from which they are derived, allowing for greater information content than would an otherwise identical monochrome graph. For example, the degree to which nodes of different colors segregate or overlap can suggest specific relationships between the publications in the query results. The graphical representation of a network is meant to provide a broad overview of the structure of meta-relationships returned by one or two queries. Each graph is downloadable in a variety of formats, including SVG, PS, PDF, and PNG. The vector formats permit image rescaling without loss of quality. Depending on the input queries and parameters, the specific coloring and arrangement of nodes and edges can mean a variety of different things. In all cases, nodes that were derived from the first query are colored blue, nodes derived from the second query are colored yellow, and nodes derived from papers appearing in both queries are colored green. Figure 2 can be used as a reference for interpreting the meaning of nodes and edges for each of the parameter combinations. Generally speaking, subsets of nodes that are highly connected are drawn together in tight clusters, whereas sparsely connected nodes are spread further apart. If two queries are entered, then the degree to which the two colors overlap on the graph can also be significant. These relationships can be compared quantitatively by exporting the network to TopNet [15], which calculates average degree, clustering coefficient, characteristic path length, and diameter for any network. TopNet automatically scores PubNet networks by clicking the 'Export to TopNet' icon below any PubNet query result. Hyperlinks to a textual representation of every graph are provided on its results page. The textual representation provides a summary list of all nodes and edges that comprise the network. Each entry in the summary is a hyperlink to a detailed description. For nodes, a list of outgoing edges as well as a list of all connected neighbors and their respective edges are shown, with common edges highlighted. Relevant external databank links are also provided at the top of the page. The detailed view of an edge shows a list of all nodes connected by that edge. Note that in the SVG graphical format each node is also a hyperlink to its entry in the text version of the network, which allows one to navigate quickly from an interesting region in the graph to a detailed description of its components. Applications Recent advances in high throughput techniques have made it possible to conduct biomedical research on a larger scale than was previously possible. These efforts often involve large groups of scientists from multiple institutions working in close collaboration on high throughput experiments, data collection, and analysis. There is little precedent in the biological sciences for executing or evaluating such large scale endeavors, but in the latter case a logical place to start is the product of those endeavors, namely publications. As we demonstrate below, the organization and output of a collaboration is very well reflected by patterns that can be extracted from its publication list in Figure 3. The Protein Structure Initiative (PSI) is a large-scale effort led by the US National Institutes of Health that is aimed at streamlining the process of three-dimensional protein structure determination, with the long range goal of providing three-dimensional structures of most proteins in nature. Nine structural genomics research centers are supported by the PSI, each of which has its own expertise, organization, and research focus [16]. To demonstrate the versatility of PubNet, we generated several graphs based on publication lists from each PSI center (Figure 4), including the Northeast Structural Genomics (NESG) consortium. Structural genomics centers attempt to solve structures at very high throughput, and each center has its own unique approach to accomplish this task. Because the PSI is still in its pilot stages, it is yet to be determined which approach is the most successful. Here we show how organizational, geographic, and social patterns of large collaborative research efforts are reflected in their publications. Collaborative organization of single consortium We begin by illustrating the types of relationships that can be extracted from a single query (Figure 3). A query consisting of a list of all NESG PubMed IDs was analyzed using four different combinations of node and edge types, and each yielded strikingly different graph structures. Depending on the parameters that were specified to generate the graph, these linkages may correspond to similarity between papers, frequency of copublication between two authors (for a given query), common geographic sources for publications, and so on. The scalable vector graphics formats supported by PubNet allow one to zoom in on specific regions in the graph. Each node in the graph image is hyperlinked to a detailed textual report, which includes a hyperlinked list of all outgoing edges and a list of all neighboring nodes with their respective edges. Thus, starting directly from the graphical output, it is possible to explore specific node-edge linkages in detail. In the graph shown for the NESG consortium in Figure 3b, nodes are authors (researchers) and edges represent co-authorships on publications. It demonstrates the confederated but coordinated approach used by the NESG consortium, which includes two protein sample production centers, at least six different sites at which three-dimensional structures are determined by nuclear magnetic resonance or X-ray crystallography, and a loosely coupled group of some dozen laboratories working on various aspects of the technology development and annotation. Comparison of several consortia We also compare the publication authorship patterns of each of the PSI centers in Figure 4, using nodes to represent authors and edges to represent co-authorship. Because a single set of parameters was used across multiple queries, the underlying relationships between nodes are identical for each graph, and so differing graph structures correspond to variations in the global structure of these relationships. A diverse array of graph structures is evident, highlighting significant differences in size, frequency in publication, and degree of cooperation across the consortia. For example, the Tuberculosis Structural Genomics consortium [17] conducts its experiments in small separate groups, whereas the Joint Center for Structural Genomics [18] uses a more centralized approach. Groups such as the NESG [19] and New York Structural Genomics Research Consortium [20] employ an intermediate approach, in which central groups are tightly clustered but also linked to other groups in a collaborative pipeline. A simple example with Protein Data Bank IDs In addition to extracting and rendering authors and papers as nodes, PubNet is able to use databank accession numbers found in PubMed citations, such as PDB, GenBank, or Swiss-Prot IDs. These databanks have tens or hundreds of thousands of entries, and so when using databank IDs as nodes it is often useful to limit the scope and date range of queries to PubNet to avoid overly complex results. Figure 5 shows a basic example using PDB IDs as nodes and MeSH terms as edges. The first query, namely 'DNA polymerase 2004[dp]', is limited to a specific type of protein and to papers published in 2004. The second query - 'RNA polymerase 2004[dp]' - is similar. Blue nodes cluster tightly together, as do yellow nodes, indicating that they are highly similar. Nodes in separate clusters are connected to each other in some cases. By examining the textual view of the nodes, it is easy to understand the underlying structure. Predictably, blue nodes are highly linked to each other by MeSH terms related to DNA polymerase, such as 'DNA-directed DNA polymerase'. Yellow nodes are linked by terms such as 'RNA polymerase II'. Blue and yellow nodes occasionally link to each other by terms such as 'Models, molecular'. Green nodes, which are nodes that were extracted from papers returned by both queries, are linked to each other by the term 'DNA primase' and to other blue nodes by 'DNA-directed DNA polymerase'. Evaluating the output of the Protein Structure Initiative Figure 5 is an illustrative example; we present Figure 6 as a more practical example of the use of PubNet. To investigate the extent to which PSI structures are representative of all PDB structures, we compared several two-query PubNet graphs based on PSI and non-PSI structure publications. Two representative graphs are shown in Figure 6. To construct the queries, lists of primary citation PubMed IDs were compiled using the PDB search engine. The structural genomics PDB IDs were extracted from TargetDB [21], and sets of 300 regular PDB IDs were selected randomly from a total of 3,112 unique structures released in 2001-2002 that included a primary citation available in PubMed. Nodes were designated as papers and edges as shared MeSH terms. Because only primary citations were used, there is a one-to-one mapping of papers to PDB structures. Each node thus corresponds to a PDB structure, and the associated MeSH terms provide a description of that structure. Functional similarity among a subset of structures results in more common MeSH terms, which is reflected in the graph by greater connectivity of the nodes, and tighter clustering of the nodes relative to dissimilar nodes on the graph. To compare PSI structures with general PDB structures, two types of graphs were generated. First, a two-query graph was generated with all available PSI structure associated PubMed IDs comprising the first query, and a random set of 300 PDB IDs comprising the second query (Figure 6a). The second type of graph was generated by running two random sets of 300 PDB IDs against each other (Figure 6b). We have observed that differing patterns in PubNet graphs among ostensibly similar queries can reveal underlying differences derived from the content of the publications returned by each query. Major features that can vary include the degree of aggregation of nodes into different clusters (roughly indicating the subject of the protein structure) and the balance of both blue and yellow nodes within the various clusters. If PSI structure publications are indistinguishable from random PDB structure publications, then we would expect the graphs based on PSI structures publications versus random PDB structure publications to have a similar character to graphs based on two random sets of PDB structures. However, as shown in Figure 6a, the PSI structure publication nodes do not intersperse with regular PDB structure nodes as much as two sets of random structures. The PSI nodes clearly tend to aggregate in tighter neighborhoods than do the other nodes. Although this is by no means definitive, the differential clustering might indicate some underlying differences between the PSI structures and random PDB structures. One obvious source of difference in the structure publications is the fact that many PSI structures are un-annotated 'hypothetical proteins', and so they lack the MeSH terms required for greater dispersal. Another factor might be that similar methods are used to determine PSI structures, and this is reflected in their publications. Assessing results with TopNet In addition to examining the textual representation of the graph, qualitative assessments of the network structure can be verified by exporting the results of any PubNet query to TopNet. One particularly useful descriptor is the average degree of a network, which is the average of the degrees of each node. In a PubNet graph, node degrees increase with more common edge terms between the nodes. A high average degree indicates that the nodes are highly connected to each other. Note that the utility of many topological descriptors depends on the connectedness of a graph. For a more detailed explanation of descriptors, see the report by Yu and coworkers [15]. In Table 1, we compare the TopNet generated graph statistics of several graphs shown in figures cited above. In Figure 4 the Joint Center for Structural Genomics graph is highly connected, and the Tuberculosis Structural Genomics consortium graph is sparsely connected. This difference is particularly evident in the 'average degree' scores for each graph. In Figure 5 we see that nodes from the two 'polymerase' queries are very similar in layout and connectedness. As expected, their TopNet scores are nearly identical. For Figure 6 we see that PSI nodes have a much higher average degree, lower diameter and average distance, and increased clustering coefficient when compared with random sets of PDB nodes. We note that when looking at a large numbers of nodes, even small differences in graph statistics are meaningful. Each feature of the graph confirms what is clearly visible in the graphical output; PSI nodes are better connected to each other and cluster more tightly together in comparison with random PDB nodes. Conclusion In this paper we present PubNet, a web tool that can be used to extract and visualize a variety of relationships between publications indexed by PubMed. Distinguishing features of PubNet include its ability to generate several different types of graphs based on a single query and to accommodate two queries simultaneously, which greatly facilitates graph comparison. The basic functionality of PubNet is demonstrated by its application to publications derived from the PSI, which revealed a diverse array of collaborative patterns in the different research centers as well as increased similarity between primary citations associated with those structures relative to a random sample of PDB structure citations. It is unclear whether, once properly annotated, these differences will remain. By focusing on PSI publications we offer only a small glimpse of the possible uses of PubNet. Although only 15 combinations of node and edge parameters are currently supported, the number of different queries that can be entered is unrestricted. We have included a 'save' feature that permanently links any PubNet graph to a user gallery, and we invite the community to submit queries and comments.

0 comments Cited 20 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (publisher-id): plos

Journal ID (publisher-id): plcb

Journal ID (pmc): ploscomp

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date Collection: March 2008

Publication date (Print): March 2008

Publication date (Electronic): 28 March 2008

Volume: 4

Issue: 3

Electronic Location Identifier: e1000037

Affiliations

[1 ]Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America

[2 ]Program in Computational Biology and Bioinformatics, Department of Molecular Biophysics and Biochemistry, and Department of Computer Science, Yale University, New Haven, Connecticut, United States of America

Author notes

* E-mail: bourne@ 123456sdsc.edu

Article

Publisher ID: 08-PLCB-ED-0082

DOI: 10.1371/journal.pcbi.1000037

PMC ID: 2275780

PubMed ID: 18369428

SO-VID: 979ff719-1f98-488b-aa80-91bb58941080

Copyright © Bourne et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Open Access: Taking Full Advantage of the Content

Read this article at

Abstract

Related collections

Journal of Systems Thinking

Most cited references 10

The mice that warred.

Will a Biological Database Be Different from a Biological Journal?

PubNet: a flexible system for visualizing literature derived networks

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 5

Cited by 5

Most referenced authors 55