Monday, November 27, 2006
The Future of Science is Open, Part 2: Open Science
In Part 1 of this essay, I gave an outline of the scholarly publishing practice/philosophy known as Open Access; here I want to examine ways in which the central concept of OA, the "open" part, is being expanded to encompass all of science.
Though I am adopting the term "Open Science", there are an number of similar and related terms and no clear overriding consensus as to which should prevail. This year's iCommons Summit saw the conception and initiation of the Rio Framework for Open Science. Hosted on the iCommons wiki, the Framework is presently an outline consisting mainly of a useful collection of links and does not offer a formal definition. In a 2003 essay, Stephen Maurer noted that:
Open science is variously defined, but tends to connote (a) full, frank, and timely publication of results, (b) absence of intellectual property restrictions, and (c) radically increased pre- and post-publication transparency of data, activities, and deliberations within research groups.
Jamais Cascio and WorldChanging have been talking about open source science, making a direct analogy to open source software, for some time. Chemists Without Borders follow Cascio's definition in their position statement:
Research already in progress is opened up to allow labs anywhere in the world to contribute experiments. The deeply networked nature of modern laboratories, and the brief down-time that all labs have between projects, make this concept quite feasible. Moreover, such distributed-collaborative research spreads new ideas and discoveries even faster, ultimately accelerating the scientific process.
Richard Jefferson, founder and CEO of CAMBIA, uses the term BiOS (either "Biological Innovation for Open Society" or "Biological Open Source"), and the Intentional Biology group at The Molecular Biosciences Institute talks about Open Source Biology. Peter Murray-Rust has recently put together a Wikipedia page on Open Data; he writes:
Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control.
Though Science Commons, which grew out of Creative Commons, doesn't use the term "Open Data", they have a "data project" and the concept is clearly central to their efforts. Best and most open of all (in my opinion), Jean-Claude Bradley has coined the term Open Notebook Science, by which he means:
...there [exists] a URL to a laboratory notebook (like this) that is freely available and indexed on common search engines. It does not necessarily have to look like a paper notebook but it is essential that all of the information available to the researchers to make their conclusions is equally available to the rest of the world. Basically, no insider information.
For what I am calling Open Science to work, there are (I think) at least two further requirements: open standards, and open licensing.
In his introduction to the chemistry-focused Blue Obelisk (group? movement?), Peter Murray-Rust refers to Open Standards as "visible community mechanisms which act as agreed protocols for communicating information". What he is talking about is metadata and a semantic web for science. To see this idea in action, consider the following citation:
Hooker CW, Harrich D. The first strand transfer reaction of HIV-1 reverse transcription is more efficient in infected cells than in cell-free natural endogenous reverse transcription reactions. Journal of Clinical Virology vol 26 pp.229-38 (2003)
You can read that, but a computer cannot do anything really useful with the text string as given: it has no idea which part of the string means me and which means Dave, where the title begins and ends, which numbers are page numbers and which are a date, and so on. Now remember that PubMed, the database from which I got it, contains millions of such citations (and abstracts, and links between papers that cite each other, and so on). Stored as text strings, they would be impossibly clumsy, but with the addition of a little simple metadata:
Author/s: Hooker CW, Harrich D.
Title: The first strand transfer reaction of HIV-1 reverse transcription is more efficient in infected cells than in cell-free natural endogenous reverse transcription reactions.
Journal: Journal of Clinical Virology
the citation is broken down into meaningful fields, each of which can be searched or otherwise manipulated separately. The computer can now treat each string after "Author/s:" as a series of substrings (author names) separated by commas and ended with a period, the numbers after "Pages:" as a numerical range, and so on and on -- which means you can ask the database useful questions, like "show me all the papers written by Hooker, CW between the years 2000 and 2006 and published in J Virol". There you have (a very simple example of) the two pillars of a semantic web: metadata and standards. Examples abound: the Proteomics Standards Initiative, MIAPE, MIAME, Flow Cytometry Standards, SBML, CML, another CML, the Open Microscopy Environment and dozens of others. Metadata and associated standards are going to be increasingly necessary to scientific communication and analysis as more and more of it takes place online and as datasets grow ever larger and more complex. Science commons makes the point using the tumor suppressor TP53:
There are 39,136 papers in PubMed on P53. There are almost 9,000 gene sequences [...] 3,800 protein sequences [and] 68,000 data sets available. This is just too much for any one human brain to comprehend.
Quite apart from lack of brainspace, there are answers in those datasets to questions that their creators never thought to ask. In the same way that Open Access accelerates the research cycle and facilitates collaboration, so too does Open Data -- and Open Standards is the infrastructure that makes it possible.
In a similar vein, Open Licensing also provides a kind of infrastructure -- in this case, for dealing with intellectual property issues. It's fine to simply put your product on the web and let the world do as it will, but many people prefer (or, depending on where they work, are legally required) to retain some control over what others do with their work. In particular, if you are concerned with openness you may want to ensure that the original and all derivative works remain part of the commons (e.g. copyleft rather than copyright). That means reserving at least some rights, which is where licensing comes in.
As with Open Access, the original model comes from software licenses. The Free Software Foundation publishes three licenses designed to provide and protect end-user freedoms and maintains a list of other software licenses classified according to compatibility with FSF licenses. The Open Source Initiative also maintains a list of approved licenses which meet their (slightly less restrictive) standards for Open Source. If you are looking for a publishing license (for audio, video, images, text and/or software), Creative Commons is the place to go: they offer six main licenses which provide varying degrees of freedom to end-users, a think-before-you-license guide and a handy tool for choosing which license suits you best. They also offer a number of more specialized licenses and the FSF GPL and LGPL software licenses. Every CC license is provided in three formats: legal code that will stand up in court, a plain-language summary and a machine-readable version (built-in Open Standards!) that CC-savvy search engines can use to filter results by CC end-user freedoms. As with the copyleft protections in the GPL, CC offers "share-alike" licenses that maintain end-user freedoms throughout derivative works. The example that impresses me most strongly with the power of CC licenses is that Public Library of Science journals, collectively the flagship of Open Access publishing, are all released under a CC attribution license. If you find yourself dealing with someone else's license -- for instance, a publishing company -- and you want to provide Open Access, you can use the SPARC author addendum: simply attach a completed copy of the addendum to the publishing agreement and bring the publisher's attention to it; more than 90% of journal editors will comply. You can also get an author addendum from Science Commons, who are working with SPARC and will soon offer plain-language and machine-readable versions like those that accompany CC licenses, as well as a web-based tool for choosing and preparing the appropriate addendum.
That covers copyright-based licensing, pretty much; but patenting is a whole different headache for Open practices. Copyright inheres automatically (though there is a registry) in "original works of authorship" as soon as they are created, but patents are granted for inventions by way of a drawn-out administrative process and on a more complex basis than "who made this?". There are also important differences between patent laws in different countries. The primary test-bed for open licensing approaches has been biotechnology and especially genomics, with particular emphasis on specific gene sequence data and databases . The concern is that too much patent protection, combined with patents of too broad a scope, will stifle research and in particular exacerbate the difficulties faced by poorer nations in trying to establish research and development infrastructure.
One possible soution, at least for database information, is offered by the HapMap Project's "click-wrap" license. Rather than assigning property rights, this is an end-user agreement that specifically disallows the patenting of genetic information from the database, unless such claims do not restrict others' free access to the database. This license has since been abandoned by the HapMap project, however, in order to allow integration of HapMap data into other public databases such as GenBank.
Other solutions focus on assigning property rights in such a way as to permit Open practices. Yochai Benkler suggests what he calls publicly minded licencing for universities and academic institutions. This form of licensing would consist primarily of an "open research license", whereby the institutions would reserve the right "to use and nonexclusively sublicense its technology for research and education", and would require a reciprocal license to research such that any (sub)licensee must "grant back a nonexclusive license to the university to use and sublicense all technology that the licensee develops based on university technology, again, for research and education only". There is a model for this sort of scheme in PIPRA, a collaboration among public sector agricultural research institutions which employs licensing language that aims to protect humanitarian use. In a similar vein, Benkler also suggests a second variety of licence, a "developing country license", which would extend the open protections through development and manufacture to end-products such as drugs, so long as distribution was limited to developing countries. Noting that University revenues from government research grants and contracts are at least an order of magnitude greater than those derived from patents, Benkler points out that the loss of certain licensing revenue would be minor at most. The loss of the small possibility of a "gold-mine" patent would be more than compensated by gains in research efficiency and public perception of universities as public interest organizations rather than puppets of big business.
Science Commons has a more specific focus with its biological materials transfer project, which is aimed at retooling materials transfer agreements. These are the contracts under which research laboratories exchange the physical objects of research -- DNA, proteins, chemicals, whole organisms, and so on. There is no standard format, since even the NIH Office of Technology Transfer's Uniform Biological Materials Transfer Agreement (UBMTA), despite wide support, does not cover all eventualities and is frequently modified or replaced with institution-specific MTAs. I can tell you from experience that these things can be a nightmare. The one I remember most clearly came from a large pharmaceutical firm which shall remain nameless; they were willing to send us some of their antiretroviral in pure form, provided we signed over our firstborn children and their children unto the seventh generation. (I exaggerate, but you get the idea. In the end we crushed up pills supplied by friendly clinicians, and the damn drug did nothing in our assay anyway.) Science Commons' efforts in this field have yet to bear fruit (that I know of), but given the Science Commons/Creative Commons track record I have high hopes.
There is also a more fully-developed model available. The international nonprofit organization CAMBIA offers two BiOS licences designed to create and protect a "research commons" (the Plant Enabling Techology License and the Genetic Resource Technology license) and is currently drafting a third license for health-related technology. The essence of these licenses is a reciprocality agreement similar in concept to copyleft or "share-alike", such that
...licensees cannot appropriate the fundamental "kernel" of the technology and improvements exclusively for themselves. The base technology remains the property of whatever entity developed it, but improvements can be shared with others that support the development of a protected commons around the technology, and all those who agree to the same terms of sharing obtain access to improvements, and other information, such as regulatory and biosafety data, shared by others who have agreed.
To maintain legal access to the technology, in other words, you must agree not to prevent others who have agreed to the same terms from using the technology and any improvements in the development of different products.
In addition to the licenses, CAMBIA maintains BioForge, an open-source platform for research collaboration on which the licenses and other open practices can be, as it were, field-tested.
I think "Open Science" is the banner under which the various Open X clans might most profitably assemble. It is punchy, fairly self-explanatory and does not carry any of the potential confusion with related movements in software that might plague "Open Source Science". (Nor, for that matter, will it give rise to daft analogies about what exactly is science's "source code".) Moreover, it seems a natural counterpart to the established term Open Access, and is apparently the term of choice for Science Commons/iCommons, which puts the considerable weight of the Creative Commons behind it. My personal favourite (term and practice) is Open Notebook Science, but this seems better suited to being the name of the most open subset of Open Science practices since, as with Open Access, it is likely that a range of applications will co-exist and co-evolve.
A formal definition will have to wait for future conferences at which scientists and their allies can hammer out the Open Science equivalent of the BBB Declarations. For now, I think the Wikipedia Open Science stub has the right idea in propounding a sort of meta-definition: "a general term representing the application of various Open approaches... to scientific endeavour". Andrés Guadamuz González ventures "the application of open source licensing principles and clauses to protect and distribute the fruits of scientific research". In a recent paper (sorry, subscription only; see how useful OA is?) Ibanez et al. put it this way:
The Open Science movement advances the idea that the results of scientific research must be made available as public resource. Limiting access to scientific information hinders innovation, complicates validation, and wastes valuable socio-economic resources. Open Science is an effective way of overcoming the nearsightedness of the contemporary obsession with intellectual property. The practice of Open Science is based on three pillars: Open Access, Open Data, and Open Source.
It seems to me that Access and Data are crucial by definition; you could do Open Science which relied on proprietary software, provided you made the raw data and your publications openly accessible. It is, of course, more efficient to use software that is available to everyone without intellectual property or cost barriers. Similarly, open standards and open licensing might not be fundamental to the practice of Open Science, but both make possible such vast increases in efficiency that I would argue for their inclusion in any comprehensive definition or declaration.
In short, Open (Access + Data + Source + Standards + Licensing) = Open Science.
Once again, this is an enormous topic and I have given only a brief overview; if you spot anything I have missed or got wrong, please leave a comment. (I am a scientist, after all; I am thoroughly inured to being wrong in public.) This was supposed to be the second of two essays on the future of science, but I have run out of room and time so there will now be a third instalment. In that piece I will try to show what Open Science looks like now, in its infancy, and to sketch some of the directions in which it might grow.
Update: part 3 is here.
This work is licensed under a Creative Commons Attribution 3.0 License.
Posted by Bill Hooker at 03:13 AM | Permalink