As we see COVID19 affected 1.5M people globally and 100k deaths, the worldwide AI research community are trying help medical community whatever the way we could. Many of medical questions are leads to research papers for answers. These questions are suitable for text mining, and developing text mining tools to provide insights on these questions.
In this blog, I want to share how you can use BERT based embeddings pre-trained model can be used to build a text-based Question and Answering tool for fining answers related to Coronavirus from COVID19 research papers repository.
This solution is a type of Question Answering model. It is a retrieval-based QA model using embeddings. The basic idea of this solution is comparing the question string with the sentence corpus, and results in the top score sentences as an answer. I create a vector representation of each sentence using a pre-trained BioSentVec embedding model and KNN to find the answer sentences.
Below are the steps I took to preprocess the data. Again, you can access full code in my GitHub.
● Read JSON files from dataset
● Parse JSON files
● Remove hyperlinks, references
● Extract section details and split article by sentences using NLTK’s sent_tokenize method.
● Generate pandas data frame for model
Loading the BioSentVec pretrained model
model_path = ‘BioSentVec_PubMed_MIMICIII-bigram_d700.bin’
model = sent2vec.Sent2vecModel()
except Exception as e:
print(‘model successfully loaded’)
Vectorize the sentence corpus with BioSentVec model
embs = model.embed_sentences(convid_sent_df[‘sentence’])
KNN and Ranking
The k-Nearest Neighbors algorithm (KNN)t is a very simple technique. First, I loaded entire vectorized sentences into the model as training. When I need to find the answer, we need to send a vectorized question string as input to the model and the knn model outputs the most similar records from the training sentence corpus with the score. From these neighbors, a summarized answer is made. A similarity between records can be measured in many different ways. I used default here.
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=2, algorithm=’ball_tree’).fit(embs)
Test the Model
Using trained kneighbors model passing the question. As “ What is the physical science of the coronavirus’
emb = model.embed_sentence(‘Physical science of the coronavirus’)
distances, indices = nbrs.kneighbors(emb)
Using BioSentVec embedding and KNN algorithms is not the only solution, here are few alternative approaches.
● We can develop a custom model based on predefined keyword (fuzzy keywords) domain dictionary and redefined rules using regular expression for building a rule-based semantic search tool. This kind of model works very well with a narrow knowledge set and can be implemented quickly.
● And many more options
I am planning to try a similar question answering model with twitter streaming data.
Final Note: I am not a heather professional to comment on the output provided by the model and this article purpose to illustrate the possible solution.
● This solution primarily based on the BioSentVec pre-trained morel big thanks for the developers and the original BERT model developed.