The Future of Science is Open (Access)
By Bill Hooker
Editor’s Note: APS recently instituted an open acccess initiative for its journals called “Free to Read”(see the October 2006 APS News, available online). Open Access is a controversial idea, and much of the impetus for it has come from within the biomedical community. This article surveys the issue from the perspective of a molecular biologist
I’ve never had an idea that couldn't be improved by sharing it with as many people as possible–and I don't think anyone else has, either. That’s why I have become interested in the various “open” movements making increasing inroads into the practice of modern science. The best known of these, apart from the familiar Open Source (Free) Software movement, is the Open Access approach to research literature.
Open Access (OA) entails the freedom to read, use and redistribute the published results of scholarly research and derivative works based on those publications. OA literature is digital, online, free of charge and free of most licensing restrictions. What makes it possible is the consent of the author or copyright-holder (hence the focus on scholarly articles, for which authors are not usually paid), and the internet. Online publishing is much less expensive than its print-only ancestor, but it is not free; the big question of OA is how to pay the bills that do remain without charging access fees. Nearly all current OA models reduce to one of two basic blueprints: OA archives/repositories, and OA journals.
OA archives or repositories simply make their contents freely available to the world. They may contain preprints, refereed postprints, or both. Archiving preprints does not require any form of permission, and a majority of journals already permit authors to archive their postprints. Archives which comply with the metadata harvesting protocol of the Open Archives Initiative are interoperative and can be searched as though they comprised a single virtual database, using services such as OAIster. There are a number of open-source software packages available for building and maintaining OAI-compliant archives; Peter Suber maintains a list of lists of such archives, and SHERPA maintains a database of journal policies regarding pre/post-print archiving. Archives cost very little to set up and maintain, and increasing numbers of universities and research institutions are building their own. PubMed Central, maintained by the NIH, is probably the largest and best-known in biomedical science. ArXiv, run by Cornell University, is the principal means of transfer of research results for many (if not most) mathematicians and physicists.
OA journals are in most respects the same sorts of entities as traditional paid-access journals, but without the access fees. They perform peer review, and make the refereed articles available free to all comers. They pay the bills in a number of different ways. About half charge author-side fees, though who actually pays these is widely variable (author, author’s institution, funding body, etc.). The Directory of Open Access Journals (www.doaj.org) currently lists nearly 2500 peer-reviewed OA journals. Three of the most prominent OA journal publishers are the Public Library of Science, Hindawi Publishing and BioMed Central, and a number of traditional publishing companies now offer OA options. A Personal Example
. More than half of my publications to date are not freely available from the journals in which they were published. You cannot read them without paying a fee or relying on a library which carries (and has therefore paid for) the journal and issue in question, and neither can my professional colleagues.
For you as a taxpayer, this means that you are denied access to information for whose production you've already paid (since I’ve always been funded by government grants). For me as a scientist, it means that more than half of my work is, while not useless, certainly of much less use to the world than it might be. Fortunately, all of the journals concerned allow postprint archiving (though they don’t allow use of the published pdf), so I might be able to rescue it. I’ll have to either find a repository that will take the articles, or make one of my own. Whatever I do, I’m going to have to track down the published versions and then reverse-engineer an “unofficial” version. Why would I go to all this trouble? Because OA offers significant benefits and advantages to a variety of stakeholders: Maximal research efficiency
. The usual version of Linus’Law says that given enough eyeballs, all bugs are shallow–meaning that with enough people co-operating on a development process, nearly every problem will be rapidly discovered and solved. The same is clearly true of complex research problems, and OA provides a powerful framework for co-operation. For instance, Brody et al. showed that, for articles in the high-energy physics section of arXiv, the time between deposit and citation has been decreasing steadily since 1991, and dropped by about half between 1999 and 2003. Alma Swan explains: “the research cycle in high energy physics is approaching maximum efficiency as a result of the early and free availability of articles that scientists in the field can use and build upon rapidly.”
Moreover, the machine readability of a properly formatted body of open access literature opens up immense new possibilities. Paul Ginsparg, founder of arXiv, observes: True open access permits any third party to aggregate and data mine the articles, themselves treated as computable objects, linkable and interoperable with associated databases. We are still just scratching the surface of what can be done with large and comprehensive full-text aggregations
Examples include cheminformatics.org and the family of utilities and tools available through the NIH/NLM’s PubMed interface. Maximal return on public investment
. Just as OA is primarily aimed at literature for which the authors are not paid royalties, so one obvious focus of attention is government-funded research. Why should taxpayers pay twice, once to support the research and then again when the scientists they are funding need access to the literature? Open access to a body of knowledge makes that knowledge more available and useful to researchers, physicians, manufacturers, inventors and others who make of it the various socially desirable outcomes, such as advances in health care, that government funding of research is intended to produce. Advantages for authors
. There are well over 20,000 scholarly journals, and even the best-funded libraries can afford subscriptions to only a fraction of them. OA offers authors a virtually unlimited, worldwide audience: the only barrier is internet access. There is a large and steadily growing body of evidence showing that OA measurably increases citation indices. For instance, of the papers published in the Astrophysical Journal
in 2003, 75% are also available in the OA arXiv database; the latter papers account for 90% of the citations to any 2003 Astrophysical Journal
article, a 250% citation advantage for OA. Repeating the exercise with other journals returns similar results.
Not only is this of vital importance to academics when it comes to applying for funding or competing for tenure, it’s more or less the whole point of publishing research in the first place: so that other people can read and use it. Advantages for publishers
: the benefits that accrue to authors of OA works also work to the advantage of publishers: more widely read, used and cited articles translates to more submissions and a wider audience for advertising, paid editorials and other value-add schemes. Advantages for administrators
. One of the best available proxy measures for research impact is citation counting: how many times has a given paper been cited by other researchers in their published work? This idea led to the development of the impact factor, a measure of a particular journal’s importance within its own field. These sorts of bibliometric indicators are relied upon heavily by science administrators making decisions about funding, tenure, and so on. Open access, by removing the subscription barriers that splinter the research literature into inaccessible proprietary islands, raises the possibility of vast improvements in our ability to measure and manage scientific productivity. Scalability
. Peter Suber has pointed out that, because it reduces production, distribution, storage and access costs so dramatically, OA “accommodates growth on a gigantic scale and [...] supports more effective tools for searching, sorting, indexing, filtering, mining, and alerting–the tools for coping with information overload.” Online distribution is necessary but not sufficient for scalability, because subscribers to paid-access journals do not have unlimited budgets. For end users to keep pace with the explosive growth of available information, the cost of access has to be kept down to the cost of getting online. Open Science
. There is growing interest in extending the “open” aspect of Open Access to science as a whole. In a 2003 essay, Stephen Maurer noted that: Open science is variously defined, but tends to connote (a) full, frank, and timely publication of results, (b) absence of intellectual property restrictions, and (c) radically increased pre- and post-publication transparency of data, activities, and deliberations within research groups.
Peter Murray-Rust recently put together a Wikipedia page on Open Data:
He writes: “Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control.”
There are (I think) at least two requirements beyond Access and Data: Open Standards, and Open Licensing. Consider the following citation:
Hooker CW, Harrich D. The first strand transfer reaction of HIV-1 reverse transcription is more efficient in infected cells than in cell-free natural endogenous reverse transcription reactions. Journal of Clinical Virology
vol 26 pp.229-38 (2003) You
can read that, but a computer cannot do anything really useful with the text string as given: it has no idea which part of the string means me and which means my co-author, where the title begins and ends, which numbers are page numbers and which are a date, and so on. Now remember that PubMed, the database from which I got it, contains millions of such citations (and abstracts, and links between papers that cite each other, and so on). Stored as text strings, they would be impossibly clumsy, but see what happens with the addition of simple metadata (in bold): Author/s: Hooker CW, Harrich D.
Title: The first strand transfer reaction of HIV-1 reverse transcription is more efficient in infected cells than in cell-free natural endogenous reverse transcription reactions.
Journal: Journal of Clinical Virology
Now the citation is broken down into meaningful fields, each of which can be manipulated separately. The computer can now treat each string after “Author/s:” as a series of comma-delimited substrings (author names), the numbers after “Pages:” as a numerical range, and so on–which means you can ask the database useful questions, like “show me all the papers written by Hooker, CW between the years 2000 and 2006 and published in J Virol.” There you have a very simple example of the two pillars of a semantic web: metadata and standards.
Semantic markup is going to be increasingly necessary to scientific communication and analysis as more and more of it takes place online and as datasets grow ever larger and more complex. Science Commons makes the point using the tumor suppressor TP53: There are 39,136 papers in PubMed on P53. There are almost 9,000 gene sequences [...] 3,800 protein sequences [and] 68,000 data sets available. This is just too much for any one human brain to comprehend.
Quite apart from lack of brainspace, there are answers in those datasets to questions that their creators never thought to ask. In the same way that Open Access accelerates the research cycle and facilitates collaboration, so too does Open Data–and Open Standards is the infrastructure that makes it possible.
Similarly, Open Licensing also provides a kind of infrastructure–in this case, for dealing with intellectual property issues. It's fine to simply put your product on the web and let the world do as it will, but many people prefer to retain some control over what others do with their work. In particular, if you are concerned with openness you may want to ensure that the original and all derivative works remain part of the commons. That means reserving at least some rights, which is where licensing comes in. Open copyright licenses are fairly well established, from software licenses like the GPL to the various Creative Commons deeds. In contrast, efforts to make patent-based licenses “open” are just beginning. Science Commons is working on materials transfer agreements, and PIPRA and CAMBIA offer two working models for technology and data licensing.
Overall, I think “Open Science” is the banner under which the various Open X clans might most profitably assemble. Access and Data are crucial by definition, and although you could do Open Science on proprietary software (provided you made data and publications openly accessible), it is much more efficient to use Open Source software that is available to everyone without intellectual property or cost barriers. Similarly, Open Standards and Open Licensing might not be fundamental to the practice of Open Science, but both make possible such vast increases in efficiency that I would argue for their inclusion in any comprehensive definition or declaration.
In short, Open (Access + Data + Source + Standards + Licensing) = Open Science. Bill Hooker is a molecular biologist by trade; he lives in Portland, OR and works on Myc-related transcrip-tion factors in cancer and development. Further Reading: