Open Data in Science: a NeuroMat op-­ed

by Kelly Rosa Braghetto*

With the increasing use of computation in several areas of science, data have assumed an increasingly important role in the process of scientific discovery. Computational analysis of data collected in digital format and the generation of data through computer simulation have figured prominently in the list of tools of scientific research nowadays.

Many of the scientific results that are reported today rest upon digital data collected or generated in scientific experiments. So it is quite natural to expect that these data are made publicly accessible, so that the results can be validated and reproduced. However, publicly available raw or derived data collected in scientific experiments are unfortunately not the rule but the exception. In many areas of science, collecting data in experiments is difficult and costly - in terms of time and money - and often poorly recognized by the scientific community. Many scientists fear that by publicly disclosing details of their experiments they might not receive proper credit and even lose a strategic advantage, that is, the exclusive use of such data.

A considerable part of scientific research conducted in the world is funded by public money. Therefore, sharing all products of this research with citizens who actually pay for it is often seen as a civic duty. For example, in the UK there are laws (the Freedom of Information Act – FOIA, and the Environmental Information Regulation - EIR) that guarantee to every citizen the right to access to information that is held by public local institutions - which includes research data from universities and other research institutions funded by public money. To comply with these laws, scientists are told to set, still in the design stage of their research projects, data management and sharing policies. They rely on technical support from the Digital Curation Centre (DCC), one of the most important centers of expertise in research-data management in the world.

In Brazil, Decree No. 7,724, from May 16, 2012, known as the Law on Access to Information, governs the access to information produced or held by governmental, federal organizations and agencies, including federal universities and funding agencies for research (eg, the Coordination for the Improvement of Higher Education Personnel – CAPES, and the National Council for Scientific and Technological Development - CNPq). This regulation covers information relating to scientific and technological development and research projects, with the exception of those whose secrecy is essential to the security of society and the State. An interesting aspect of the Brazilian Law on Access to Information is that it contains a component of "active transparency,” which states that "it is the duty of agencies and public organizations to promote, regardless of application, the disclosure on their websites of information that have collective or general interest that they produce or keep under their custody."

The Law on Access to Information is relatively new and has been used primarily for government transparency. It has not impacted significantly the scientific-data management and sharing policies in Brazil yet. Anyway, the discussion on the importance of Open Science is already spreading across the country and begins to show its first results. For example, the São Paulo Research Foundation (FAPESP) already requires that data from research projects that it funds to be publicly available, and requires computational tools developed in these projects to be available as free software. The Research, Innovation and Dissemination Center for Neuromatemathics (NeuroMat), funded by FAPESP, is an example of a project that supports Open Science. NeuroMat is developing a pioneering work in Brazil with the creation of an open database for neuroscience data (physiological measures and functional assessments), aiming to bolster progress in the understanding of brain function and treatment of neurological diseases.

The structure of NeuroMat's open database, as of June 2014.

The provision of open-access scientific databases involves technical and legal aspects. To guarantee that data can be understood and reused properly, one needs information on the provenance and structure of data in the database, that must also be made available. Data provenance describes how, when, where, why and by whom data were collected, while the structure describes how data are organized. In some cases, one also wants to be assured that scientists and institutions have received the recognition and attribution for databases they provide and protect their interests regarding the use of data. Legally, this involves defining who the holders of the copyright on databases are and writing licenses on "rights and duties" of those who use these databases.

Only recently licenses that were originally conceived for free software and content began to be adapted for use in databases. In the scientific community, licenses that are more widely used for sharing open databases are Creative Commons (CC) and Open Data Commons (ODC). These licenses allow, for example, the limiting of the use of data to those who credit authors or providers and also the establishing that data redistributions can only be made with the same or equivalent license as the original data used. All ODC licenses are specific to databases. ODC has two licenses that apply only to data structure and one that only applies to content. Early versions of the CC licenses only applied to database contents. However, the newest version, Creative Commons 4.0 - CC4.0, has specific elements on databases; its six licenses apply both to content and structure.

When the goal is to create open databases (not just public databases), one must be careful with choosing a license. According to the definition of the Open Knowledge Foundation, “open data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike”, that means giving credit to authors and licensing the derivations under equivalent terms. Therefore, one should not, for example, associate a license that restricts using a database for commercial purposes or which only allows the redistribution of the same data features without derivations. All ODC licenses are compatible with the the definition of open data, but some CC are not. One should also avoid creating new specific licenses for certain institutions or projects; the use of standardized licenses such as CC and ODC reduces incompatibility issues. In addition, these licenses are drafted in accordance with international copyright law, which facilitates their use in many countries worldwide. The intent of this is to avoid having several small sets of scientific data publicly available but that cannot be combined into a single repository or analyzed jointly and uniformly by a computer system – which are evidently goals of scientific interest.

To eliminate legal barriers, facilitate sharing and maximize data reuse, projects that are dedicated to open data sharing (like VertNet, from the National Science Foundation, and Canadensys, from the Canada Foundation for Innovation) recommend that scientific datasets should be placed in the public domain by waiving all author's and provider rights. The CC0 – CC Public Domain Dedication, and the ODC Public Domain Dedication and License (PDDL) ODC are instruments that can be used for this purpose.

Lastly, it is worth noticing that copyright laws do not apply to any databases. In general, copyright applies to databases whose content is the result of some intellectual effort or whose compilation or organization consumes significant resources of time or money. For this reason, databases consisting of purely factual information and that do not require an original organizational structure cannot be protected under copyright or licensing. Anyway, even when there is uncertainty about the rights that fall over a database publicly available, it is always interesting to associate it with a license because it is a document that clearly describes the terms of use of the data.

Benefits of opening scientific data to society are undeniable. They foster science of a better quality and greater impact (by allowing the validation of peers and increasing scientific publication) and enable the generation of new knowledge through comparative studies, data mining, etc. - that can be done only when statistically relevant data volumes are available for analysis. It is hoped that a better understanding of the technical and legal issues involved in the provision of data in a public way may serve to encourage initiatives to open scientific data.

* Kelly Rosa Braghetto is a professor at the University of São Paulo Computer Science Department, a specialist on database modeling and an associate investigator at the Research, Innovation and Dissemination Center for Neuromathematics (NeuroMat). This op-ed will appear in Portuguese in the August edition of A Rede: Tecnologia para a Inclusão Digital.

This piece is part of NeuroMat's Newsletter #5. Read more here

Featuring this week:

Stay informed on our latest news!

Previous issues

Podcast A Matemática do Cérebro
Podcast A Matemática do Cérebro
NeuroMat Brachial Plexus Injury Initiative
Logo of the NeuroMat Brachial Plexus Injury Initiative
Neuroscience Experiments System
Logo of the Neuroscience Experiments System
NeuroMat Parkinson Network
Logo of the NeuroMat Parkinson Network
NeuroMat's scientific-dissemination blog
Logo of the NeuroMat's scientific-dissemination blog