Monday, October 30, 2006
The Future of Science is Open, Part 1: Open Access.
I've never had an idea that couldn't be improved by sharing it with as many people as possible -- and I don't think anyone else has, either. That's why I have become interested in the various "Open" movements making increasing inroads into the practice of modern science. Here I will try to give a brief introduction to Open Access to research literature; in the second instalment I will look at ways in which the same concept of "openness" is being extended to encompass data as well as publications, and beyond that, what a fully Open practice of science might look like.
The original paradigm: Open Source
Although the underlying concept of information as a public good goes back at least to the invention of the printing press and the end of the aristocratic/theocratic duopoly on literacy, programmers were the first people I know of to popularize this sort of "openness" in an academic setting. Richard Stallman started the GNU Project in 1983/4 as a reaction against the rising influence of proprietary software, and a year or so later founded the Free Software Foundation, which "is dedicated to promoting computer users' rights to use, study, copy, modify, and redistribute computer programs." What Stallman and the FSF mean by "free software" is famously summed up by the dictum, "free as in speech, not free as in beer"; more precisely, they mean "free" as in:
- The freedom to run the program, for any purpose
- The freedom to study how the program works, and adapt it to your needs
- The freedom to redistribute copies
- The freedom to improve the program and release your improvements to the public
Access to the source code is a precondition for these freedoms, and many advocates prefer that the "four fundamental freedoms" also be combined with some form of copyleft (basically a licence which explicitly disallows use of the original resource in any way that restricts the four freedoms for anyone else). About a decade later the Open Source Initiative appeared, offering itself as a "more pragmatic" approach to free software. The two definitions are pretty similar, though the OSI version allows some licencing that the FSF considers too restrictive of end users. Today, both the FSF and the OSI are powerhouse advocates for non-proprietary software, code that you can get your hands on and hack to your heart's content. There is a wealth of free software freely available for scientific purposes: for instance, the OpenScience Project maintains a list, as do (inter many alia) the NCEAS, the CBS and Indiana University. The NIH and EBI both maintain extensive services, there's an entire Linux distribution for science, SourceForge lists over 350 projects under "scientific", and a simple google search finds dozens of free applications for molecular biology.
By analogy with Open Source, Open Access to the research literature entails the freedom to read, use and redistribute the published results of scholarly research and derivative works based on those publications. What follows is a version of Peter Suber's very brief introduction to OA; for more details, see his full Open Access Overview and Timeline of the OA Movement. The bottom line is this:
Open-access (OA) literature is digital, online, free of charge, and free of most copyright and licensing restrictions. What makes it possible is the internet and the consent of the author or copyright-holder.
Most scholarly journals do not pay authors, who therefore do not lose revenue by publishing under OA conditions. Thus the controversies about OA to music and film (was Napster "piracy"? did it cost any actual musicians any money?) do not apply to the scholarly literature, the authors of which are clearly better off if access to their work is not restricted. Online publishing is much less expensive than its print-only ancestor, but it is not free; the big question of OA is how to pay the bills that do remain without charging access fees. Nearly all current OA models reduce to one of two basic blueprints: OA archives/repositories, and OA journals.
OA archives or repositories simply make their contents freely available to the world. They may contain preprints (the author's version prior to peer review), refereed postprints, or both. Archiving preprints does not require any form of permission, and a majority of journals already permit authors to archive their postprints. Archives which comply with the metadata harvesting protocol of the Open Archives Initiative are interoperative and can be searched as though they comprised a single (enormous, virtual) database, using high-level services such as OAIster. There are a number of open-source software packages available for building and maintaining OAI-compliant archives; Peter Suber maintains a list of lists of such archives, and SHERPA maintains a database of journal policies regarding pre/post-print archiving. Archives cost very little to set up and maintain, and increasing numbers of universities and research institutions are building their own. PubMed Central, maintained by the NIH, is probably the largest and best-known in biomedical science. ArXiv, run by Cornell University, is the principal means of transfer of research results for many (if not most) mathematicians and physicists. Stevan Harnad, a leading advocate of self-archiving, maintains a comprehensive self-archiving FAQ file.
OA journals are in most respects the same sorts of entities as traditional paid-access journals, but without the access fees. They perform peer review, and make the refereed articles available free to all comers. They pay the bills in a number of different ways. About half charge author-side fees, though who actually pays these is widely variable (author, author's institution, funding body, etc.). Publishing in an OA journal is obviously 100% compatible with self-archiving. The DOAJ currently lists nearly 2500 peer-reviewed OA journals, of which more than 700 are searchable at the article level; for larger lists of OA journals which may or may not be peer-reviewed, see JournalSeek or Yahoo's Free Full Text. Three of the most prominent OA journal publishers are the Public Library of Science, Hindawi Publishing and BioMed Central, and a number of traditional publishing companies now offer OA options.
A personal example
I have yet to publish any data here in the US, but I published a dozen or so articles while I was at the University of Queensland. More than half of these are not freely available from the journals in which they were published (J Clin Virol, Virology, Biochim Biophys Acta, Mol Biochem Parasitol, Acta Tropica -- all Elsevier journals, pfui! -- and Rev Med Virol from Wiley InterScience). I couldn't find any full-text copies online using Google Scholar or PubMed, either. You cannot read these seven papers of mine without paying a fee (usually around $30) or physically going to a library which carries (and has therefore paid for) the journal and issue in question. Neither can my professional colleagues, unless their institution happens to subscribe to the journal or some package which includes it; these subscription fees are commonly extortionate (Elsevier being a particularly egregious offender).
For you as a taxpayer, this means that you are denied access to information you've already paid for (since I've always been funded by government grants). For me as a scientist, it means that more than half of my life's work to date is, while not useless, certainly of much less use to the world than it might be. Given that a large part of why I do what I do is that I want to leave the world a better place than I found it, that is simply not acceptable to me. Fortunately, according to RoMEO, all of the journals concerned allow postprint archiving by authors, so I might be able to rescue it. Searching for "queensland" in DOAR (one of a number of such directories) leads me to ePrints UQ, so there is a relevant archive for me to use, but there's a catch: you have to be a current UQ staff member to deposit. I can (and will) talk to David Harrich, my boss at the time, about archiving all of our HIV papers, since Dave is still at UQ. My schistosomiasis papers, though, have no one on the author lists who could deposit them, so I'll have to contact the staff at ePrints UQ and see whether there's a way for ex-staff to deposit articles. If there isn't, I'll have to either find another repository that will take the articles, or make one of my own. Since my current employers don't have an institutional repository, I'm going to have to make that choice anyway for upcoming papers. Both arXiv and Cogprints will take biology papers, although mine don't seem to fit into any of their categories, and Peter Suber has mentioned building a Universal Repository in collaboration with the Internet Archive, but I'm not sure if anything has come of that endeavour. That leaves me with the option of building my own archive, for the purposes of which there are numerous open-source software packages available. Alternatively, at least as a first step, I could simply upload the papers to my own webspace somewhere and try to make sure the the Internet Archive and Google Scholar know about them, so that they would be available though not interoperable with other repositories. Finally, there's one last catch: Elsevier won't let me use their pdf versions, and I don't have the original files in most instances. So whatever I do, I'm going to have to track down the published versions and then reverse-engineer an "unofficial" version.
Why would I go to all this trouble? Because OA offers significant benefits and advantages to a variety of stakeholders:
Benefits of Open Access
1. Maximal research efficiency. The usual version of Linus' Law says that given enough eyeballs, all bugs are shallow -- meaning that with enough people co-operating on a development process, nearly every problem will be rapidly discovered and solved. The same is clearly true of complex research problems. and OA provides a powerful framework for co-operation. For instance, Brody et al. showed that, for articles in the high-energy physics section of arXiv (one of the oldest archives available for such study), the time between deposit and citation has been decreasing steadily since 1991, and dropped by about half between 1999 and 2003. Alma Swan explains: "the research cycle in high energy physics is approaching maximum efficiency as a result of the early and free availability of articles that scientists in the field can use and build upon rapidly".
Moreover, the machine readability of a properly formatted body of open access literature opens up immense new possibilities. Paul Ginsparg, founder of arXiv, observes:
True open access permits any third party to aggregate and data mine the articles, themselves treated as computable objects, linkable and interoperable with associated databases. We are still just scratching the surface of what can be done with large and comprehensive full-text aggregations.
...exciting new developments in text-mining and data-mining are beginning to show what can be done to create new, meaningful scientific information from existing, dispersed information using computer technologies. Research articles and accompanying data files can be searched, indexed and mined using semantic technologies to put together pieces of hitherto unrelated information that will further science and scholarship in ways that we have yet to begin imagining. These technologies are just in their infancy at the moment. Real scientific advances will be made using them but the technologies can only be applied effectively to the open access corpus: literature and data hidden behind journal or databank access restrictions are invisible to the computer tools that can do this work...
2. Maximal return on public investment. Just as OA is, at least for now, primarily (though not exclusively) aimed at literature for which the authors are not paid any kind of royalty, so one obvious focus of attention is government-funded research. Why should taxpayers pay twice, once to support the research and then again when the scientists they are funding need access to the literature? More importantly, open access to a body of knowledge makes that knowledge more available and useful to researchers, physicians, manufacturers, inventors and others who make of it the various socially desirable outcomes, such as advances in health care, that government funding of research is intended to produce. Peter Suber has gone over this intuitive position in some detail here.
3. Advantages for authors. There are well over 20,000 scholarly journals, and even the best-funded libraries can afford subscriptions to only a fraction of them. OA offers authors a virtually unlimited, worldwide audience: the only access barrier is internet access (which is, of course, cheaper to provide in poorer nations than comprehensive libraries of print journals would be!). There is a large and steadily growing body of evidence showing that OA measurably increases citation indices (that is, the number of times other papers refer to a given article). For instance, of the papers published in the Astrophysical Journal in 2003, 75% are also available in the OA arXiv database; the latter papers account for 90% of the citations to any 2003 Astrophysical Journal article, a 250% citation advantage for OA. Repeating the exercise with other journals returns similar results.
Not only is this of vital importance to academics when it comes to applying for funding or competing for tenure, it's more or less the whole damn point of publishing research in the first place: so that other people can read and use it!
4. Advantages for publishers: the benefits that accrue to authors of OA works also work to the advantage of publishers: more widely read, used and cited articles translates to more submissions and a wider audience for advertising, paid editorials and other value-add schemes.
5. Advantages for administrators. One of the best available proxy measures for research impact is citation counting: how many times has a given paper been cited by other researchers in their published work? This idea led to the development of the impact factor, a measure of a particular journal's importance within its own field. These sorts of bibliometric indicators are relied upon heavily by science administrators making decisions about funding, by faculties making decisions about tenure cases, and so on. Open access, by removing the subscription barriers that splinter the research literature into inaccessible proprietary islands, raises the possibility of vast improvements in our ability to measure and manage scientific productivity.
6. Scalability. Peter Suber has pointed out that, because it reduces production, distribution, storage and access costs so dramatically, OA "accommodates growth on a gigantic scale and, best of all, supports more effective tools for searching, sorting, indexing, filtering, mining, and alerting --the tools for coping with information overload." Online distribution is necessary but not sufficient for scalability, because subscribers to paid-access journals do not have unlimited budgets even if they are enormous institutional libraries. For end users to keep pace with the explosive growth of available information, the cost of access has to be kept down to the cost of getting online.
Tune in Next Time
In the second instalment, I will look at open access to raw experimental data, cooperation over competition as a research model and the ever-expanding role of the Web in science. In the meantime, if this has piqued anyone's interest in OA (and I hope it has!), here are my Simpy collections of open access and open science links.
One Last Thing
This is an immense topic, and anyone who knows anything much about it will certainly see things I've missed or got wrong. That's what the comments are for! Blogs are conversation tools, and I'd appreciate your feedback.
This work is licensed under a Creative Commons Attribution 3.0 License.
Posted by Bill Hooker at 03:29 AM | Permalink