magazine_ Interview
Open Science: Prying Open the Black Box
Research culture is in many aspects broken. Making this secretive machinery transparent could help fix it.
A conversation with data scientist and open knowledge evangelist Paola Chiara Masuzzo on greedy scientific publishers, the extra cost of keeping data in the closet, and how Open Science can reconcile the fraught relationship between science and society.
Open Science tackles flaws in research culture. What is wrong with current science?
Paola Chiara Masuzzo: That’s a long list. Scientists have focused on quantity instead of quality for too long. That is because funders and universities have incentivized them to publish frequently in as many high-profile journals as possible. In order to do that, they have been trained to be secretive, because competition is high, and resources are scarce in academia. That fuels fears among them to lose the race of finding the final piece of a research puzzle. Everybody wants to be the first to discover, when that in reality isn’t important. All that counts is that science creates knowledge and fixes issues. And then there is the publishing industry we need to talk about.
What about it?
Masuzzo: One of the big problems is publication bias. If you have positive, “sexy” results, journals say: Sure, we will publish you. Results that are not affirming are way less likely to get published, though they are just as important. Then there is the language barrier. Most of the scientific literature is published in English only, assuming that every citizen in the world will be able to understand it, which is clearly not the case. Even when the general public does understand English, they are barred from accessing studies that are locked behind paywalls. Any non-scientist citizen would have to pay from their own pocket to read public research that they have contributed to by paying taxes. They already paid for it. How does that make any sense?
"If you have positive, 'sexy' results, journals say: Sure, we will publish you. Results that are not affirming are way less likely to get published, though they are just as important."
So science is mostly produced for a scientific elite.
Masuzzo: Just look at the publications. There is almost never a layman term or a graphical abstract in any one of them. We have nurtured an elitist culture in science that implies: “If you are not part of this elite, then science is not for you”. Ironically, that same science is often also not for other scientists, as many results cannot be replicated. Which is another flaw in this secretive black box. Not only can independent researchers not replicate study results, but the very researchers who carried their studies out in the first place cannot reproduce their own results.
And on top of that, you have a parasitic publishing industry.
Masuzzo: I use the following analogy to explain scientific publishing to friends, though the quote’s not mine: It’s like you are going to a restaurant and you bring your own ingredients. Then you cook and serve yourself the meal at the table. But before you can eat it, there is already someone waiting for you with the bill. And we are talking of a three- or four-fold inflated bill. As scientists, we get paid to do research but not to publish it. So we need to pay to publish, and in the process we often lose the copyright, too.
Open Access is the pledge to make journal papers readable for everyone. That solves the readers’ dilemma but does not stop scientists from being exploited, does it?
Masuzzo: No, but Open Access is only one of many Open Science components. And it tackles more than just free access to research papers for the public. Sure, I can read Open Access papers without having to pay. But it’s also important that the papers have a proper license attached. That allows me, as a scientist, to reuse what I'm reading, for example, to do text-mining. Which is impossible when you waive your copyright and lack an open license attached to your papers. That is one of the great uses of Creative Commons, for example, an effort to assign licenses to research output and properly credit the people who create that knowledge. That preserves intellectual property. One example is the CC Attribution-ShareAlike license: You can integrate existing work into yours but have to assign the same conditions plus credit the original author. But there are many more.
Why are there so many different licenses?
Masuzzo: Because there is no one-size-fits-all approach to sharing scientific data. We want science to be as open as possible and as closed as need be. For example, you could have sensitive patient data or very specific personal data that, if shared, might even harm part of the population. You can’t share this data. So the CC licenses offer granular ways to keep such data and results closed, depending on what you can disclose. All while opening only part of your work to the public domain.
Do scientists that go the Open Access route still have to pay to get published?
Masuzzo: We have different flavours of Open Access routes, and some require hefty article processing charges. Some routes are free for readers, others are free for authors. Some are free for both. If you look into the directory of open access journals, you will see that the big majority, around 70 percent of the journals now do not require the payment of article processing charges. They are called the diamond open access journals.
As good as that sounds: How many of those are actually glossy journals that scientists would want to get published in order to raise their status?
Masuzzo: Most attractive journals in terms of impact factor and other journal metrics indeed charge high article processing charges. Nature asks a fee of up to 9,500 Euro per published article. That is outrageous. We from the Open Science community demand transparency so everybody can see breakdowns of these fees. Of course there are real costs to publishing, but studies estimate that they are on average between 100 and 200 US dollars per article. So how do you get from an estimated 150 dollars to more than 10,000? That is a complete mystery.
Considering these costs, one might expect to get more than just the polished paper, like the underlying data.
Masuzzo: Hiding data is part of this mysterious black box. We often cannot reproduce our own results, and we need to fix that. The minimum standard for scientific validity is reproducibility. If I take the same data from a study and the same analysis pipeline, then I should come to the same results. If I can not do that in the first place? That says a lot about that piece of research. - that’s why Open Data is important to re-establish trust with the public that unfortunately science has lost in the last decades. And indeed among scientists.
"We often cannot reproduce our own results, and we need to fix that. The minimum standard for scientific validity is reproducibility."
How so?
Masuzzo: When you publish a study without the data, it is merely a PDF file. Which is not necessarily objective, but it is a story that shows how you came to a specific conclusion. That story advertises what you did, which is alright. But it is one specific interpretation of the data, and I need to see that data to validate whether what you have done is scientifically sound or not. If you decide to release your data, that opens up science to innovation. Machine-learning algorithms and people can ask the same datasets all sorts of questions and apply different analytical methods to answer more questions, tackling the same problem from many different angles. If we don’t publish data, we indeed lose money. A lot of it.
We lose money?
Masuzzo: Imagine how much time and resources you have to invest into collecting data, wrangling data, cleaning data, crafting the final data product, and then writing, submitting and polishing the paper. If you don't publish your data in a FAIR way, if I'm interested in it but it's nowhere to be found, then I'm going to have to do the entire process by myself again. It may take me several months to arrive at a point where you have already been. That is such a waste. We have estimates that such redundant scientific work generates extra costs of around ten billion Euro per year in Europe only.
Does FAIR data mean equally accessible to anyone?
Masuzzo: It means more than that: FAIR means findable, accessible, interoperable and reusable. That doesn’t necessarily imply it is fully open data, but if you publish it, then there are ways to find it on the web and to see at least its metadata. It also means that datasets have unique digital identifiers that will help locate them online for a very long time, and associated clear license terms. That covers the findable, accessible and reusable part of the FAIR data acronym. “Interoperable” means it needs to be machine-readable, so algorithms can analyse large, openly available datasets. That creates knowledge much faster than humans could ever do.
"I believe that is one step toward closing the gap between scientists and the public."
Giving people that kind of insight into that black box called science and its inner workings means science has to let down its guard. What’s in it for science and society?
Masuzzo: I believe that is one step toward closing the gap between scientists and the public. When people don't trust science, it's not because they think that scientists are not competent enough, but it's because they are not sure that what we do is in their best interest. Opening up helps. We need to tell it like it is. Science is messy. Science changes. It tries to self-correct, and sometimes it succeeds. Sometimes it fails. Let's not forget that it is people who do science. Science has for way too long given the impression that all results need to be shiny and sexy. That is far from reality, and it conveys the wrong picture. At the end of the day, being a scientist is just a profession like any other.
About the Interviewed
Italian-born data scientist Paola Chiara Masuzzo works as a data scientist in Belgium, where she lives. Masuzzo is an Open Science advocate and an independent researcher at IGDORE, where she advocates and fosters opening up research and knowledge. She's a big fan of open data and Seinfeld, the series. Together with Yasemin Türkyilmaz-van der Velden, Paola Masuzzo is one of the independent judges for the 2021 Open Research Award at Eurac Research. You can follow her on Twitter @pcmasuzzo.
The Winners of the Eurac Research Open Research Award 2021
The main two Open Research Awards go to:
The Group “Language Technologies (LT)” at the Institute of Applied Linguistics whose purview stretches across disciplines, languages and communities and manifests itself in the active participation and coordination of initiatives designed to bring people together, invite them to join in the research and shape best practice. (read the interview)
Johannes Rainer, leader of the Team “Computational Metabolomics” at the Institute for Biomedicine, who has established successful tools and practices for open, collaborative, and reproducible research and whose engagement in a community approach to problem solving are influencing the general attitude of data scientists at the Institute and beyond in the vast R and Bioconductor communities. (read the interview)
The two Awards for Early Career go to:
Alberto Scotti, Institute of Alpine Environment, whose research on aquatic insects as sentinels of environmental changes has been done following the ideal of the open research culture and the aim of sharing every research output. (read the interview)
Giulio Genova, Institute for Alpine Environment, and Mattia Rossi, Institute for Earth Observation, who have collaboratively developed open source tools that help and enable not only researchers but also users with minimal programming skills to access and analyze meteorological and environmental data easily and efficiently. (read the article)