Implement Question Answering System on Corona : Approach — 01

Image 1: Screenshot of QA System on Corona

Background

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Data-set (CORD-19). CORD-19 includes over 40,000 scholarly articles with full text, about COVID-19, SARS-CoV-2, and related corona viruses (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Can we use Natural Language Processing (NLP) techniques to make better use of data ?

Objective

Construct Question Answering System capable of answering questions related to corona virus specifically questions like What are symptoms of corona, What do we know about COVID-19 risk factors, How to prevent corona etc. i.e. some corona literature based questions.

What you will learn ?

Learn quick and dirty end to end approach for building QA System on corona.

Idea

This article contains naive approach to solve the problem, idea is to simply start with something and try to connect the dots.

Lets start on below idea for our 1st end to end solution:

  • Download all corona articles and simply split it into sentences.
  • Filter relevant sentences based on user’s query.
  • Feed each relevant sentence as context to BERT Model fine-tuned on SQUAD data-set and show selected answers based on score given by model.

Solution

  1. Download data from Kaggle:

Use link to download research articles on corona.

Articles are stored in 4 folder:

  • biorxiv_medrxiv/biorxiv_medrxiv/
  • comm_use_subset/comm_use_subset/
  • custom_license/custom_license/
  • noncomm_use_subset/noncomm_use_subset/

Total number of articles come around ~33k as of on 31-March-2020.

2. Parse data

Downloaded articles are in .json format, need to convert all data into rows and columns where each rows signify one article.

Now, all articles stored in data-frame with columns

  • file_name: containing absolute file path (unique)
  • body: Content of article
  • abstract: Abstract of article
  • error_count: Continuous value tells if any error came while parsing the file
  • article_no: Unique Identification number for each article, simply number starting from 0

3. Preprocessing functions

Below is code for some common preprocessing (lemmatization, stopword removal) steps used

4. Convert data in to useful format

  • Sentence dictionary: Split all body and abstract from above dataframe into sentences using ```nltk.sent_tokenize``` and store in dictionary where sentence number is key and value is sentence. It will be used to retrieve sentences given sentence number.

Total No. of unique sentences comes after combining all articles ~ 37Lac as of on 31-MARCH-2020.

  • Lemmatized Word — Sentence No. Inverse dictionary: Create dictionary with key as lemmatized word from sentence and value as list of sentence number where lemmatized word is present. It will be used to retreive list of sentence number given lemmatized word.
  • Original Word — Sentence No. Inverse dictionary: Create dictionary with key as original word from sentence and value as list of sentence number where lemmatized word is present. It will be used to retreive list of sentence number given original word.
  • Corpus.txt: It contains all sentences in single .txt file, will be used as reference for query spelling correction purpose.

So, After following above steps, make sure we have following files with us

Note: stopword.txt file containing list of stopwords and not generated from above steps, Iused manual list of stopwords, you can also use nltk stopword list.

5. Query Processing System

Idea of this step is to convert free text query into list of important words in query.

List of important words from query will be {Lemmatized words + original words + expanded list of words} — {stopwords}.

Input Example (Q): What are symptoms of Corona ?

Output (L): [‘symptoms’,’symptom’,’corona’]

For Spelling correction, used Peter Norvig Spell Corrector and for its working, refer to my video 1, video 2 having complete explanation on same.

6. Filter relevant Sentences based on Query

Lets formulate this step as it is quite important part,

Lets assume we have list of sentences S (~ 37 Lac sentences) and want to find top K sentences which can probably have answer of question Q.

Any idea how we can approach it ? Please write in comment sections.

Finding relevant sentences given query can be experimented in several ways.

  • Word Presence Check: Simply search important query words (L) come from step — 5 in list of sentences (S) and selected sentences S` can be those where L is present. Limitation: May get noisy sentence as well as part of filtered sentences and it necessarily not to be relevant.
  • Information retrieval model: Think of this problem as Search engine problem (Ranking problem) where you have documents and need to retrieve top k relevant documents based on query. Here document is equivalent to sentence. Use BM-25 or Tf-IDF to get vector representation for sentence/query and use cosine similarity between them. Or Use Latent Semantic Indexing (LSI) where each document and term is expressed as a vector with elements corresponding to concepts.
  • Semantic Similarity: Get embedding at query level and sentence level and compute cosine similarity between them. We can use Google’s Universal Sentence Encoder or Facebook’s Infersent model to get embedding at query andsentence level. Refer to my kaggle talk on Semantic Text Similarity or Jupyter Notebook. Please comment if you can think of any limitation (Hint: Is it write to use this semantic model, Will Pre-trained embedding work specifically on this dataset).
  • Tuning BERT QA: Get some training data specifically for this data and fine tune BERT QA on same. Training data should have two independent features (Query and Sentence) and Output variable will be binary (0 or 1 i.e. sentence contains query answer or not). Currently, my understanding is we don’t have any pre-trained model in which if we give sentence and query, it returns 0 or 1. Limitation: We don’t have training data and I doubt if we can use SQUAD2.0 here. Also, I experienced BERT works quite slow, may be we can use Albert (Light weight Bert).

7. Use BERT Model for QA System

Once, you have K relevant sentences for a given query, need to find span of text from sentence which answers the question. We will use BERT model fine tuned on SQUAD dataset and hugging face library provide easy way to use that.

Hurray !! You learned how to develop QA System on corona dataset.

Please clap if article helps you and share with your friends as well.

Happy Learning !!

References

  1. https://www.fedscoop.com/coronavirus-database-natural-language-processing/
  2. https://code.ihub.org.cn/projects/763/repository/revisions/master/entry/docs/source/pretrained_models.rst
  3. https://www.youtube.com/watch?v=k5yxT21D9bs
  4. https://huggingface.co/transformers/pretrained_models.html
  5. https://www.kaggle.com/jonathanbesomi/a-qa-model-to-answer-them-all

Senior Data Scientist @ Fractal Analytics