If you think fact-checking is hard, which it is, then what would you say about verifying scientific claims, on COVID-19 no less? Hint: it’s also hard — different in some ways, similar in some others.
Fact or Fiction: Verifying Scientific Claims is the title of a research paper published on pre-print server Arxiv by a team of researchers from the Allen Institute for Artificial Intelligence (AI2), with data and code available on GitHub. ZDNet connected with David Wadden, lead author of the paper and a visiting researcher at AI2, to discuss the rationale, details, and directions for this work.
What is scientific fact checking?
Although the authors of the paper refer to their work as scientific fact-checking, we believe it’s important to clarify semantics before going any further. Verifying scientific claims refers to the process of proving or disproving (with some degree of certainty) claims made in scientific research papers. It does not refer to a scientific method of doing “regular” fact-checking.
Fact-checking, as defined by the authors, is a task in which the veracity of an input claim is verified against a corpus of documents that support or refute the claim. A claim is defined as an atomic factual statement expressing a finding about one aspect of a scientific entity or process, which can be verified from a single source. This research area has seen increased attention, motivated by the proliferation of misinformation in political news, social media, and on the web.
In turn, interest in fact-checking has spurred the creation of many datasets across different domains to support research and development of automated fact-checking systems. Yet, it seems like up to this point no such dataset exists to facilitate research on another important domain for fact-checking – scientific literature.
The ability to verify claims about scientific concepts, especially those related to biomedicine, is an important application area for fact-checking. Furthermore, this line of research also offers a unique opportunity to explore the capabilities of modern neural models, since successfully verifying most scientific claims requires expert background knowledge, complex language
understanding, and reasoning capability.
The AI2 researchers introduce the task of scientific fact-checking. To facilitate research on this task, they constructed SCIFACT, a dataset of 1,409 scientific claims fact-checked against a corpus of 5,183 abstracts that support or refute each claim, and annotated with rationales justifying each support / refute decision.
To curate this dataset, a novel annotation protocol that takes advantage of a plentiful source of naturally-occurring claims in the scientific literature — citation sentences, or “citances” — was used.
Why, and how, does one do scientific fact checking?
Wadden, a graduate student in the University of Washington with a background in Physics, Computational Biology, and Natural Language Processing (NLP), shared an interesting story on what motivated him to start this work. In addition the well-known issue of navigating huge bodies of scientific knowledge, personal experience played its part too.
Wadden was briefly considering a career as an opera singer, when he had a vocal injury. He visited a number of doctors for consultations, and received a number of recommendations for potential treatments. Although they were all good doctors, Wadden observed none of them was able to provide data such as the percentage of patients for which the approach works.
Wadden’s situation was not dramatic, but he could not help but think about what would happen if it was. He felt the information he was given was incomplete to able to make informed decisions, and he believed it had to do with the fact that finding that information is not easy for doctors.
The work uses a dataset specifically aimed at fact-checking COVID-19-related research. Wadden explained that the team set out to do this work in October 2019, before COVID-19 was a thing. However, they soon realized what was going on, and decided to make COVID-19 their focus.
Besides the SCIFACT dataset, the research also features the SCIFACT task, and the VERISCI baseline model. In a nutshell, they can be summarized as creating a dataset by manually annotating scientific papers and generating claims, evaluating claims, and creating a baseline AI language model for claim evaluation.
The annotation process, described in detail in the paper, is both a necessity, and a limiting factor. It is a necessity because it takes expert knowledge to be able to process citations, ask the right questions, and find the right answers. It is a limiting factor because relying on manual labor makes the process hard to scale, and it introduces bias.
Can there be bias in science?
Today NLP is largely powered by machine learning. SCIFACT developed VERISCI based on BERT, Google’s deep-learning language model. Machine learning algorithms need training data, and training data need processing and annotation by humans. This is a labor-intensive task. Relying on people to process large datasets means the process is slow and expensive, and results can be partial.
Large annotated datasets for NLP, and specifically for fact-checking do exist, but scientific fact checking is special. When dealing with common sense reasoning, Mechanical Turk workers are typically asked to annotate datasets. In scientific work, however, expert knowledge is needed to be able to understand, evaluate and process claims contained in research papers.
The SCIFACT team hired Biology undergrad and grad students for this job. Wadden is fully aware of the limitations this poses to scaling the approach up, and is considering crowdsourcing, hiring medical professionals via a recruitment platform, or assigning many Mechanical Turk workers to annotate the same work, and then averaging their answers, knowing each one will be imperfect.
Bias can be introduced in all moving parts of the process: what papers are picked, what claims are checked for each paper, what citations are checked for each claim, and how each citation is ranked. In other words: if research X supports claim A, while research Y contradicts it, what are we to believe? Not to mention, if research Y is not in the dataset, we’ll never know about its findings.
In COVID-19 times, as many people have turned armchair epidemiologists, this is something to keep in mind: Science, and data science, are not always straightforward processes that produce definitive, undisputed results. Wadden, for one, is very aware of the limitations of this research. Although the team has tried to mitigate those limitations, Wadden acknowledges this is just a first step in a long and winding road.
One way the SCIFACT team tried to address bias in selecting claims is that they extracted them from citations: They only considered claims where a paper was cited. Furthermore, they applied a series of techniques to get as high quality results as possible.
The paper selection process is driven by an initial body of seed papers: citations that reference those papers are examined. Only papers that have been cited at least 10 times can be part of the seed set, in an effort to select the most important ones. A technique called citation intent classification is used. The technique tries to identify the reason a paper is cited. Only citations referring to findings were processed.
Another important thing to note is that claims are evaluated based on the abstract of the paper they cite. This is done for simplicity, as the underlying assumption seems to be that if a finding is key to a paper, it will be mentioned in the paper’s abstract. It would be hard for a language model to evaluate a claim based on the entire text of a scientific paper.
Claims found in papers may have multiple citations. For example, the claim “The R0 of the novel coronavirus is 2.5” may cite several papers with supporting evidence. In those cases, each citation is processed independently, and a result as to whether it supports or refutes the claim, or a conclusive decision cannot be made, is obtained for each.
Wadden’s team used the SCIFACT dataset and annotation process to develop and train the VERISCI model. VERISCI is a pipeline of three components: Abstract retrieval, which retrieves abstracts with highest similarity to the. Rationale selection, which identifies rationals for each candidate abstract. Label prediction, which makes the final label prediction.
Given a claim and a corpus of papers, VERISCI must predict a set of evidence abstracts. For each abstract in the corpus, it must predict a label, and a collection of rationale sentences. Although the annotations provided by the annotators may contain multiple separate rationales, the model must simply to predict a single collection of rationale sentences; these sentences may come from multiple annotated rationales.
The team experimented to establish a performance baseline on SCIFACT using VERISCI, analyzed the performance of the three components of VERISCI, and demonstrated the importance of in-domain training data. Qualitative results on verifying claims about COVID-19 using VERISCI were promising.
For roughly half of the claim-abstract pairs, VERISCI correctly identifies whether an abstract supports or refutes a claim, and provides reasonable evidence to justify the decision. Given the difficulty of the task and limited in-domain training data, the team considers this a promising result, while leaving plenty of room for improvement.
Some exploratory experiments to fact-check claims concerning COVID-19 were also conducted. A medical student was tasked to write 36 COVID19-related claims. VERISCI was used to predict evidence abstracts. The same medical student annotator assigned a label to each claim-abstract pair.
For the majority of these COVID-related claims (23 out of 36), the rationales produced by VERISCI was deemed plausible by the annotator. The sample is really small, however the team believes that VERISCI is able to successfully retrieve and classify evidence in many cases.
Complicated process, instructive work
There are a number of future directions for this work. Besides expanding the dataset and generating more annotations, adding support for partial evidence, modeling contextual information, and evidence synthesis are important areas for future research.
Expanding the system to include partial support is an interesting topic. Not all decisions can be clear-cut. A typical example is when we have a claim about drug X’s effectiveness. If a paper reports the effectiveness of the drug in mice, or in limited clinical testing on humans, this may offer inconclusive support for the claim.
Initial experiments showed a high degree of disagreement among expert annotators as to whether certain claims were fully, partially, or not at all supported by certain research findings. Sound familiar? In those gray area scenarios, the goal is to be able to better identify the situation. What the team wants to do is to edit the claim to reflect the inconclusiveness.
Modeling contextual information has to do with identifying implicit references. Initially, annotators were instructed to identify primary and supplemental rationale sentences for each rationale. Primary sentences are those that are needed to verify the claim, while supplemental sentences provide important context missing from primary sentences that are still necessary to determine whether a claim is supported or refuted.
For example, if a claim mentions “experimental animals” and a rationale sentence mentions “test group”, whether they refer to the same thing is not always straightforward. Again, a high degree of disagreement was noted among human experts in such scenarios. Thus, supplemental rationale sentences were removed from the dataset, and the team continues to work with annotators on improving agreement.
Last but not least: Evidence synthesis basically means that not all evidence is created equal, and that should probably be reflected in the decision-making process somehow. To use an extreme example: currently, a pre-print that has not undergone peer review and a paper with 1000 citations are treated equally. They probably should not.
An obvious thing to do here would be to use a sort of PageRank for research papers, i.e. an algorithm that does for research what Google does for the web – pick out the relevant stuff. Such algorithms already exist, for example for calculating impact factors. But then again, this is another gray area.
This work is not the only example of what we would call meta-research triggered by COVID19: research on how to facilitate research, in an effort to speed up the process of understanding and combating COVID19. We have seen, for example, how other researchers are using knowledge graphs for the same purpose.
Wadden posits that these approaches could complement one another. For example, where knowledge graphs have an edge between two nodes, asserting a type of relationship, SCIFACT could provide the text on the basis of which the assertion was made.
For the time being, the work will be submitted for peer review. It’s instructive, because it highlights the strengths and weaknesses of the scientific process. And despite its shortcomings, it reminds us of the basic premises in science: peer review, and intellectual honesty.