How reliable is the data that we publish?
Data is the new buzzword. But could our desire to publish the next great discovery be leading researchers down a path of academic fictions?
Our expectation is that the data behind many discoveries have gone through rigorous statistical analysis and peer-review. However, in some small cases, this may not even be half the truth. Results may have been altered and research fraud may be being committed, when the appeal of an exciting discovery looms. How do we make data "boring" again?
Data rules supreme
In the last 10 years, major societal decisions have started to become all but guided by data and the decisions made by data. From government policy to scientific discovery. Data rules supreme. Data has become the exciting foreground of discovery. It always has, but in some sense, it's value has never been higher. But how can we be sure that what we're accepting as truth, actually *is*? How much of this data truth, is, in fact, just straight-up data dishonesty?
How bad can this "data dishonesty" really be?
Dr. Stuart Ritchie, a psychologist at King's College London, recently highlighted four key areas of "pervasive rot" in the academic world: encompassing academic dishonesty, self-aggrandizing biases, researcher error, and the over-hyping of results. Yet, there seems to be an element of denial within the scientific and academic communities. As a scientist, if you present to my data, and I can infer a statistically significant interpretation, I'm convinced. Of course, I would be. Data is a collective reality. Data is a measurement of perceivable reality. Yet, what if the data I see has been cooked in some way, massaged or manipulated to tell another truth? How could I tell?
When asked, only a small number of academics will admit to committing fraud when presenting their results, less than a fraction 2 percent (1.97%; Fanelli, 2009). As much as 34% admitting to "questionable research practices". However, when asked if they know if their colleagues have willingly committed fraud, 14% will respond with "yes". Perhaps fraud is a strong word, but academic integrity is important. That same study went on to conclude that these results are likely to represent "a conservative estimate of the true prevalence of scientific misconduct". Wow. When did academics become so fraudulent? And why?
Nowhere has this denial for the truth been more evident as with the Covid-19 pandemic. There has been a full litany of bad scientific practices. Denial of the data, retractions from top journals, a lack of data collaboration, the list is cumbersome and worrying. This ecosystem makes it almost impossible to derive any meaningful truths. Is the data we base any decisions on any more reliable than pub gossip and Facebook disinformation? And therein lies the power of data, we, as societies, rely on good science directing us towards a better society. Although better is subjective, there are objective truths within science, but our own sentience, our pride, our presuppositions, our subjective biases can and do damage us as a collective.
Academia is often viewed on a sacred pedestal, supported by the knowledge that the data is solid and the peer-review is as reliable as it is uncompromising. If your data are junk if your analyses are junk: your research will not get published. The when is always hazy. Dr. Ritchie theorizes that the problems crept in due to what he calls perverse incentives - the huge pressure to publish papers, and the pressure to bring in research grants, as well as potentially the need of the researcher to obtain the reality that lives only within their own ego. There's perhaps an exaggerated notion that every physicist wants to be the next rockstar Einstein, every biologist wants to be as revolutionary as Darwin. Academics want to win Nobel prizes, they want to be right, they want to know that their theories are correct, this is as natural for an academic as grinning to a hyena.
How can we ensure data is reliable?
Much work in the area of data reliability is already being done, but there is much more we can do.
One factor could be quality, over quantity. For example, the expectation for students working on their Ph.D.s should be focused more on the quality of the papers they produce (it is now expected that by the time you finish your Ph.D. you should have upwards of 10 citable publications). The quality of the analysis is arguably suffering. If we swop the necessity for Ph.D. students to wrack up citations an instead focus on the need for more rigorous standards and statical analyses in their work, then their work will improve, and better research would have ultimately been conducted.
Another is the openness of our data. And it is our data. We all pay, through our taxes, a significant contribution to the funding of research. And regardless of whether we engage with this research or not, we all deserve the very best and rigorous standards to be in place. It is a part of a democratic system, and science should, therefore, be governed in the same principles of fairness that have come to represent the very societies we have forged. Data needs to be open, at least at the point where a researcher publishes their interpretations of the data such that others can interrogate their data. The overall goal is to make data as democratic and accessible as possible. The openness of research is making - and has made - great strides in recent years. With the publishing tools we provide here at asencis, as well as other great platforms for sharing data and research, such as Figshare, inroads are being made. Academics, researchers, data scientists, and analysts can't be expected to be open if they don't have the tools available to make data as open as possible.
There is also some argument in the scientific community as to how much data should be shared, and how quickly (just reference the Gallant vs. De Domenico Twitter discussion to see how fraught the opinions can get). Here at asencis, we would fall on the side of De Domenico every time. De Domenico argued that Gallant’s paper had given him a series of ideas that he wanted to test but couldn’t because he needed Gallant’s data. Gallant argued that since his lab competed for the money to collect the data, they should be entitled to work on it first. Some may see this as a fair argument to make. However, this does not make for good research. If the data used to publish a significant paper is not freely accessible for other researchers, regardless of whether they are a "competing lab" or a direct competitor in the field, then the data can not be assumed to have any reliability. Researchers need to embrace competition, and not be fixated on being scooped. To conduct better research, academia needs to embrace this competition. To accept they may be "gazumped", and that being "gazumped" is, fundamentally, a good thing.
There is also the need that data is also intuitively findable. By adhering to established metadata schemas describing data, makes research findable. Here at asencis we demarcate datasets into academic branches, such as "Chemistry", as well as specific sub-domains, such as "Climatology". This seems logical and intuitive. We go a step further by adding natural language processing and advanced database search techniques to get the right datasets in front of the right people. We believe that this can help match up datasets that should be interrogated with those researchers than can.
Another tact would be for institutions and organizations involved in research to hire researchers who are committed to contributing good, reliable data to the world, and opening that up as soon as possible, as well as researchers. In 2015, the Center for Open Science’s Transparency and Openness Promotion (TOP) Committee named three aspects of dissonance within academia: “transparency, openness, and reproducibility are readily recognized as vital features of science…[and yet we have an] academic reward system that does not sufficiently incentivize open practices.” So, adding to the list, we need to incentivize open practices above all.
"Datatopia" through humility
Above all else, to improve the openness of data researchers need to build a network of intellectual humility. Progress, and career progression, is often defined by large, headline-grabbing discoveries. Career progressing headlines can define the next round of Nobel prizes. However, if we can implement a system that reflects the very best human virtue: humility, we can ultimately create a more open, and reliable data ecosystem where science is incremental, and only transformative when necessary. When the data says it should be. That way, by removing human bias and human ego, we can begin to ensure what we are accepting as truth is more reliable than ever before.