Bubble Nebula (NASA, ESA, and the Hubble Heritage Team (STScI/AURA)) | asencis
Get Started Sign Up with your ORCID

When we talk about open data, what exactly are we talking about?

Data publishing is often seen as the last step in the research process. Here at asencis, we're aiming to make it the start of your workflow.

Publishing your data is often marked as the final step of your research workflow. You've analyzed your results, and the pre-print has been submitted - so it's time to publish your data? Perhaps not. Here at asencis, publishing your data, even under a temporary embargoe, is an important step in discovering the qualitiative interpretations of your data and improving your research efficiency, collaboratively.
asencis | Michael Roberts

Written by Michael Roberts

· 24min read

What exactly is open data?

A seemingly simple question to answer. Open data, is, well, data that is open. Well just what exactly do we mean by this? To use the official Open Knowledge Foundation definition:

"Open data and content can be freely used, modified, and shared by anyone for any purpose"

The full Open Definition gives precise details as to what this means. To summarize the most important aspects

  • Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
  • Re-use and Redistribution: the data must be provided under terms that permit re-use and redistribution including the intermixing with other datasets.
  • Universal Participation: everyone must be able to use, re-use and redistribute - there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.

There are also the FAIR data principles. By far the most important aspect of the FAIR data principles is "I": interoperability. Interoperability denotes the ability of diverse systems and organisations to work together (to "inter-operate").

Providing a clear definition of openness ensures that when you get two open datasets from two different sources, you will be able to combine them together, and it ensures that we avoid our own "tower of babel": lots of datasets but little or no ability to combine them together into the larger systems where the real value lies.

However, one aspect which stirs deep ethical debates within the open data publishing world is that of when should we publish our data openly? Should certain researchers, or research-groups, have "first dibs"? Why should one organisation pay for the aquisition of data, only for another organisation to be able to access this data. Should we publish before we have had first dibs, or publish after? The answer to these questions is seemingly not so simple.

Publish before or publish after?

In the world of data publishing, the consensus is very much moving firmly towards the belief that data should be published openly. This is the cornerstone of the open data model. Ethically, nothing more can be done other than to publish your data. This option provides your colleagues with the necessary foundations to interrogate your research, and also to bolster and improve it.

However, the same level of consensus is not placed on when we should be publishing our data. Should we publish once we have had time to decipher exactly what it means, protecting as much intellectual or institutional property as possible, or should we be throwing it out to the wolves like chunks of meat and hope for the best?

This is a major issue of contention within the world of open research, open science, and open data. Many open data purists believe that any data, regardless of its state, should be published as soon as possible. Before any analysis has been done. Before we've even had a chance to understand the underlying implications of what it could be telling us. Others believe that the data represents something more than just research, such that the institutions funding a major project or experiment should have first dibs on understanding the data. As a right. There could also be legla constraints placed on the publishing of data, there may be privacy issues or even funder constraints. This makes publishing data openly a lot tougher to navigate.

Most data repositories will allow you to place a temporary embargo on the publishing of your dataset. During this embargo period the description of the dataset is published, and the data is only accessible by approved collaborators or cross-institutional teams. However, what is the argument for even publishing data up front prior to releasing it? Are we purely publishing our data to satisfy our own moralistic stance? To appear holier-than-thou, when in reality, we're merely minting a new DOI to go against our portfolio of data?

Concrete examples of when open data has worked

So we hopefully at this point agree that it is neccessary to publish your data, here at asencis we're agnostic to when that data becomes open. As long as the data becomes open within a reasonable time frame.

But does publishing your data actually make a real-world difference to real-world research?

Take Purvesh Khatri's active mining of open clinical datasets. This mining of data led to the discovery of what is known as a gene-expression pattern that corresponds to active tuberculosis (TB). His gene-expression pattern proposed a possible diagnostic method for TB that does not require the collection of sputum samples. Purvesh didn't need to run clinical trials by himself and, better still, he didn't need to have his own laboratory. He was able to just take a look at what's available publicly and openly, and make some really astonishing discoveries.

Another example of open data efficacy making a real-world difference was the Alzheimer's disease Neuroimaging Initiative (ADNI). ADNI aimed to validate a specific model of how different AD biomarkers change during the development of Alzheimer's disease. This initiative agreed to assemble common datasets and work collaboratively, making every finding public, and has made massive strides in being able to diagnose, and most importantly, to predict the onset of AD.

In the world of astronomy, both the Sloan Digital Sky Survey (SDSS) and the Hubble Space Telescope (HST) have provided large swathes of the data in their repositories open to the public. At the time of writing, there have been over 5600 published papers based on the SDSS open data. Furthermore, Oxford University, headed by Prof. Chris Lintott, used the SDSS to provide the data needed to drive forward the Galaxy Zoo project. It was one of the first citizen science projects. Millions of people, including primary-school students and middle-school students are using the data to classify the galaxies for astronomers. Allowing the public to participate in several key scientific discoveries.

So, when should we publish?

There are, of course, many examples of when data publishing both as soon as the data is available, and long after intellectually property dibs have run out have been successful. The key, in our mind, to open data publishing is that the data is published, and open for all, with both meticulous meta-data descriptions that describe the experimental conditions as accurately as possible and the tools and resources necessary for understanding the dataset. A trusted data repository should therefore be responsible for representing the interests of the researchers who create the data and supporting their research workflow and the propensity and ease to which they can work collaboratively. It must ensure that the data are well organized and are not maliciously tampered with. A responsible data repository of this nature could reduce the probability of data misuse. Here at asencis, we're aiming to achieve this open data holy grail.