Like all epidemics, the COVID-19 pandemic escalates in a crisis of individual and collective character. While in some ways crises bring people together, they also expose and stress systemic flaws and vulnerabilities. Case in point: Scientific research.
Science is one of the pillars upon which modern society is built. The scientific method is what underlies many achievements, including technology and data-driven decision making. This, however, does not mean science is without its own issues.
Producing results to fight against the SARS-CoV-2 (coronavirus) is one of the most pressing issues today, bringing the entire scientific community together. Addressing issues related to scientific research can help produce results under pressure.
ZDNet connected with two prominent researchers to discuss how they are using the state of the art in analytics and AI, graphs analytics and knowledge graphs, to facilitate scientific research for the.
Scientific data is unFAIR, and that hinders COVID-19 research
Dr. Alexander Jarasch is Head of Data Management and Knowledge Management at Germany’s National Centre for Diabetes Research (DZD). Jarasch notes that data is typically scattered across different locations. In addition, for bigger organizations, and because of historical reasons, data is unFAIR — the opposite of FAIR: Findable, Accessible, Interoperable, and Reusable.
“Especially in life science, we have highly connected data, very heterogeneous data, and the entities are connected in a very complex way. And GDPR regulations make working with data a bit more complicated,” said Dr. Jarasch.
Jarasch pointed out that the coronavirus causes infectious disease, so it is especially complex. Each virus on its own has a strategy to get into the cell to reproduce and infect other cells. Research has to go on, as we don’t have enough experiments available yet. Many events in this disease are not known yet because there’s not enough data. Because of the way the virus replicates and mutates, developing a vaccine can be really complicated:
“There is no one drug that will likely save us from everything. There are many different drugs on many different patient groups that respond to one or the other treatment. I wouldn’t recommend blindly running any algorithm on any data. The number of data points and dependencies between data points is too high for humans to cope with.
That’s why you need computer assisted analysis or AI or other machine learning algorithms in order to analyze the data. Graph enables a new dimension of data analysis by helping us connect highly heterogeneous data from various disciplines. We need to identify connections in our graph to get new hypotheses and new evidence for one or the other problem.”
Dr. Jarasch is involved in the COVID GRAPH project. This is a voluntary initiative of graph enthusiasts and companies aiming to build a knowledge graph with relevant information about the COVID-19 and the SARS-CoV-2 virus. As he pointed out, it includes about 44,000 publications, mostly from pre-print servers:
“This is a good example, because nobody can ever read all these papers, understand them, analyze them, and bring them together in a way that makes sense. Then we have coronavirus relevant patents, case studies, genes, functions, molecular data, and each and every day there are more data sources to be integrated.”
COVID GRAPH brings together a diverse team of scientists, developers, data scientists, as well as more than seven companies. It’s mainly intended for scientists in healthcare or life science, but it can also be of interest to others. It’s publicly available, free of charge, and soon, it could also help scientists studying other diseases potentially linked to the coronavirus.
The goal is to provide sources of information that are connected via the fundamental entities in the biomedical domain: genes, proteins, and their functions. Bringing siloed data together can uncover previously unnoticed connections, and this is where knowledge graphs offer advantages.
Making data FAIR with Knowledge Graphs
Making data FAIR is key in facilitating scientific research in general, and coronavirus research in particular. This is also a key goal of the Open Research Knowledge Graph (ORKG) project. ORKG aims to describe research papers in a structured manner, making them easier to find and compare.
Dr. Sören Auer is the Director of TIB, the Leibniz Information Centre for Science and Technology and University Library. TIB acts in the capacity of the German National Library of Science and Technology. Dr. Auer is a Data Science professor with many contributions in Knowledge Graph research and is leading ORKG.
Dr. Auer identified two key issues in scientific research. First, integrating and semantically representing heterogeneous data about patients, diseases, drugs, clinical trials, etc. Second, representing the state-of-the-art from papers in a more comparable and reproducible fashion.
Knowledge graphs help capture the meaning of data, information, and knowledge. Knowledge graphs are a technology that is enjoying its moment of hype now, has been around for about 20 years, and is here to stay. It enables interlinking, interconnecting and integrating heterogeneous data from various sources in various formats, modalities, levels of structure, and governance schemes.
As a result, the effort required for preparing and integrating data for answering specific research questions is dramatically reduced, and AI techniques can be readily applied. ORKG focuses on representing scientific contributions from papers semantically. This makes comparing differences and similarities of different approaches easier, by juxtaposing them in tabular views or domain-specific visualizations.
Dr. Auer pointed to an example of representing and comparing the R0 reproductive number of SARS-CoV-2 from several publications. In epidemiology, the R0 basic reproduction number of an infection can be thought of as the expected number of cases directly generated by one case in a population where all individuals are susceptible to infection. R0 expresses how fast infections spread.
R0 is a key parameter used in epidemiology models and publications, and comparing it across different publications can help researchers be aware of the underlying assumptions of different models. Visualizations offered by ORKG provide a quick overview across different studies, without having to read and manually compare them. This is infinitely more scalable.
In COVID GRAPH, too, there are two aspects. One is the database itself, which stores the data that is connected. There is also a GUI through which users can query and investigate data. Having the result from a query is just the beginning for interactive browsing and discovering new things that are connected with the result.
Knowledge graphs can be stored in any back end, from files to relational databases or document stores. But since they are, well, graphs, it does make sense to store them in a graph database. This greatly facilitates storage and retrieval, as graph databases offer specialized structures, APIs, and query languages tailored for graphs.
Graph databases come in two main flavors, depending on which graph model they support: Property graph and RDF. In general, RDF graph databases emphasize semantics and interoperability, while property graph databases emphasize ease of use and performance.
There is ongoing work to bridge the two approaches in the graph database community, and we have actively been involved in it. So, when it comes to scientific research, especially at a time of crisis, we would expect to see joining forces to build on this momentum. We were not disappointed.
Auer and Jarasch not only eagerly agreed to provide an overview of their efforts, but they are also making a joint appearance in an online Meetup to elaborate further. There is a common goal (facilitating scientific research for COVID-19) and a common approach (using graph analytics and knowledge graphs). The focus is on describing and structuring publications semantically.
As Dr. Jarasch noted, a property graph is a little bit different from a knowledge graph, in the sense you are storing properties on nodes and edges that you can query. In a knowledge graph, you can integrate more knowledge when you are creating new relationships between nodes that have specific evidence attached to them.
As Dr. Jarasch said:
“COVID GRAPH is, I would say, a little bit of both. It’s more a knowledge graph than a property graph, but since we are integrating fundamental entities like genes, proteins and transcripts and clinical trials, I would also say this is part of a property graph. I would say that the answer is both depending on what you query.
We have the publications and the patents, and some texts extracts from different sources. They have to be structured in a way so that you connect the elements that belong together. On the other hand you divide bigger text chunks into parts that make sense, and then step-by-step analyzing semantically and annotating the texts and connecting them to the different entities.”
Dr. Auer noted that property graph technology can be a basis for building knowledge graphs:
“We use a property graph as a basis, but equip it with unique URI identifiers, vocabularies as well as RDF export and SPARQL querying facilities. In order to facilitate large scale distributed knowledge integration, we need to build on the W3C semantic technology standards like URIs, RDF, OWL, SPARQL, etc.”
ORKG is looking for partners to help develop domain-specific showcases, in particular for virology and epidemiology. The plan is to create domain-specific knowledge observatories, which represent the state of the art in a certain field and allow researchers to get a quick overview. ORKG is open source, open data, and open knowledge, and Dr. Auer noted they are happy to engage in collaborations.
COVID GRAPH is currently integrating more data sources like clinical trials, and connecting entities from potentially related diseases like diabetes, cancer or lung diseases. Other action points are running pattern finding algorithms to find new patterns or relationships, and working more on the GUI and user experience side. There is a public chat forum where you can get involved or contact the team.