Implement Question Answering System on Corona : Approach — 01

8 min readApr 14, 2020

Image 1: Screenshot of QA System on Corona

Background

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Data-set (CORD-19). CORD-19 includes over 40,000 scholarly articles with full text, about COVID-19, SARS-CoV-2, and related corona viruses (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Can we use Natural Language Processing (NLP) techniques to make better use of data ?

Objective

Construct Question Answering System capable of answering questions related to corona virus specifically questions like What are symptoms of corona, What do we know about COVID-19 risk factors, How to prevent corona etc. i.e. some corona literature based questions.

What you will learn ?

Learn quick and dirty end to end approach for building QA System on corona.

Idea

This article contains naive approach to solve the problem, idea is to simply start with something and try to connect the dots.

Lets start on below idea for our 1st end to end solution:

Download all corona articles and simply split it into sentences.
Filter relevant sentences based on user’s query.
Feed each relevant sentence as context to BERT Model fine-tuned on SQUAD data-set and show selected answers based on score given by model.

Solution

Download data from Kaggle:

Use link to download research articles on corona.

Articles are stored in 4 folder:

biorxiv_medrxiv/biorxiv_medrxiv/
comm_use_subset/comm_use_subset/
custom_license/custom_license/
noncomm_use_subset/noncomm_use_subset/

## Storing file absolute path in listfrom tqdm import tqdm
tqdm.pandas()
import pandas as pd
import json,os,gcpath1 = 'biorxiv_medrxiv/biorxiv_medrxiv/'
path2 = 'comm_use_subset/comm_use_subset/'
path3 = 'custom_license/custom_license/'
path4 = 'noncomm_use_subset/noncomm_use_subset/'paths = [path1,path2,path3,path4]
file_names = []
for path in paths:
  temp_file_names = os.listdir(path)
  file_names.extend([path+file_name for file_name in temp_file_names])

Total number of articles come around ~33k as of on 31-March-2020.

2. Parse data

Downloaded articles are in .json format, need to convert all data into rows and columns where each rows signify one article.

## function to parse datafrom time import time
from copy import deepcopy
import nltk
from nltk import sent_tokenize,word_tokenize
from nltk.corpus import wordnet,stopwords
from nltk.stem import WordNetLemmatizer
# stopwords = set(stopwords.words('english'))
# nltk.download('all')
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()
import os,re,multiprocessing,joblib
from multiprocessing import Pool
from collections import defaultdictdef file_content(file_path):
  abstract='';body_text = '';error_count = 0
  if os.path.splitext(file_path)[1]=='.json':
    f = open(file_path)
    f_json = json.load(f)
    try:
      abstract = f_json['abstract'][0]['text']
    except:
      error_count+=1
    for i in f_json['body_text']:
      try:
        body_text= body_text+' '+i['text']
      except:
        error_count+=1
    body_text = body_text.strip()
    f.close()
    return body_text,abstract,error_count
  else:
    return body_text,abstract,error_count## Storing article and related information in data-framedf = pd.DataFrame({'file_name':[],'body':[],'abstract':[],'error_count':[]})df['file_name'] = file_names
df['article_no'] = list(range(df.shape[0]))for ind,info in tqdm(df.iterrows(),total=df.shape[0]):  df.loc[ind,'body'],df.loc[ind,'abstract'],df.loc[ind,'error_count'] = \
  file_content(file_path=info['file_name'])

Now, all articles stored in data-frame with columns

file_name: containing absolute file path (unique)
body: Content of article
abstract: Abstract of article
error_count: Continuous value tells if any error came while parsing the file
article_no: Unique Identification number for each article, simply number starting from 0

3. Preprocessing functions

Below is code for some common preprocessing (lemmatization, stopword removal) steps used

corpus_file = 'corpus.txt'
sent_dict_file = 'sent.joblib.compressed'
word_sent_no_dict_file = 'word_sent_no.joblib.compressed'
orig_word_sent_no_dict_file = 'orig_word_sent_no.joblib.compressed'
stopword_file = 'stopword.txt'## Lemmatization functiondef get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()
def get_lemmatize(sent):
  return " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in word_tokenize(sent)])def parallelize_dataframe(df, func, num_partitions, num_cores):
  df_split = np.array_split(df, num_partitions)
  pool = Pool(num_cores)
  df = pd.concat(pool.map(func, df_split))
  pool.close()
  pool.join()
  return dfdef fn_lemmatize(data):
  for ind,info in tqdm(data.iterrows(),total=data.shape[0]):
    data.loc[ind,'sentence_lemmatized'] = get_lemmatize(sent = info['sentence'])
  return data## removing stopwordsdef words(text): return re.findall(r'\w+', text.lower())
stopwords = list(set(words(open(stopword_file).read())))def remove_stopwords(sent):
  ## case conversion - lower case
  word_tokens = words(text=sent)
  #sent = sent.lower()
  #word_tokens = word_tokenize(sent)
  ## removing stopwords
  filtered_sentence = " ".join([w for w in word_tokens if not w in stopwords])
  ## removing digits
  filtered_sentence = re.sub(r'\d+','',filtered_sentence)
  ## removing multiple space
  filtered_sentence = words(text = filtered_sentence)
  return " ".join(filtered_sentence)def fn_stopword(data):
  for ind,info in tqdm(data.iterrows(),total=data.shape[0]):
    sent = info['sentence_lemmatized']
    data.loc[ind,'sentence_lemma_stop'] = remove_stopwords(sent)
  return datadef fn_stopword_orig(data):
  for ind,info in tqdm(data.iterrows(),total=data.shape[0]):
    sent = info['sentence']
    data.loc[ind,'sentence_stop'] = remove_stopwords(sent)
  return data

4. Convert data in to useful format

Sentence dictionary: Split all body and abstract from above dataframe into sentences using ```nltk.sent_tokenize``` and store in dictionary where sentence number is key and value is sentence. It will be used to retrieve sentences given sentence number.

## creating sentence dictionary
df['article'] = df['body']+' '+df['abstract']
df['article'].fillna('',inplace=True)
article_no_sent_dict = dict()
for ind,info in tqdm(df.iterrows(),total=df.shape[0]):
  article_no_sent_dict[info['article_no']] = sent_tokenize(str(info['article']))article_no_list = list();sent_list = list()
df_sent = pd.DataFrame({'article_id':[],'sentence':[]})for i in tqdm(article_no_sent_dict,total=len(article_no_sent_dict)):
  article_no_list.extend([i]*len(article_no_sent_dict[i]))
  sent_list.extend(article_no_sent_dict[i])df_sent['article_id'] = article_no_list ; df_sent['sentence'] = sent_list
df_sent['sent_no'] = list(range(df_sent.shape[0]))## sentence level dictionary
sent_dict = dict()
for ind,info in tqdm(df_sent.iterrows(),total=df_sent.shape[0]):
  sent_dict[info['sent_no']] = info['sentence']sent_dict[-1] = 'NULL'
sent_dict_file = 'sent.joblib.compressed'
joblib.dump(sent_dict,sent_dict_file, compress=True)

Total No. of unique sentences comes after combining all articles ~ 37Lac as of on 31-MARCH-2020.

Lemmatized Word — Sentence No. Inverse dictionary: Create dictionary with key as lemmatized word from sentence and value as list of sentence number where lemmatized word is present. It will be used to retreive list of sentence number given lemmatized word.

## lemmatization over sentence
df1 = deepcopy(df_sent)
df1 = parallelize_dataframe(df=df1, func=fn_lemmatize, num_partitions=27, num_cores=27)## removing stopword from lemmatized sentence
df1 = parallelize_dataframe(df=df1, func=fn_stopword, num_partitions=30, num_cores=35)## saving inverse dictionary on lemmatized sentence i.e. word and sentence no
word_sent_no_dict = defaultdict(list)
for ind,info in tqdm(df1.iterrows(),total=df1.shape[0]):
  sent_words = words(info['sentence_lemma_stop'])
  for w in sent_words:
    word_sent_no_dict[w].append(info['sent_no'])joblib.dump(word_sent_no_dict,word_sent_no_dict_file, compress=True)

Original Word — Sentence No. Inverse dictionary: Create dictionary with key as original word from sentence and value as list of sentence number where lemmatized word is present. It will be used to retreive list of sentence number given original word.

## saving inverse dictionary on original sentence i.e. word and sentence nodf1 = deepcopy(df_sent)
df1 = parallelize_dataframe(df=df1, func=fn_stopword_orig, num_partitions=35, num_cores=35)orig_word_sent_no_dict = defaultdict(list)
for ind,info in tqdm(df1.iterrows(),total=df1.shape[0]):
  sent_words = words(info['sentence_stop'])
  for w in sent_words:
    orig_word_sent_no_dict[w].append(info['sent_no'])joblib.dump(orig_word_sent_no_dict,orig_word_sent_no_dict_file, compress=True)

Corpus.txt: It contains all sentences in single .txt file, will be used as reference for query spelling correction purpose.

## Corpus - for spelling correction modeloutF = open(corpus_file, "w")
for line in tqdm(df_sent['sentence'],total=df_sent.shape[0]):
 # write line to output file
 outF.write(line)
 outF.write("\n")
outF.close()

So, After following above steps, make sure we have following files with us

Note: stopword.txt file containing list of stopwords and not generated from above steps, Iused manual list of stopwords, you can also use nltk stopword list.

5. Query Processing System

Idea of this step is to convert free text query into list of important words in query.

List of important words from query will be {Lemmatized words + original words + expanded list of words} — {stopwords}.

Input Example (Q): What are symptoms of Corona ?

Output (L): [‘symptoms’,’symptom’,’corona’]

For Spelling correction, used Peter Norvig Spell Corrector and for its working, refer to my video 1, video 2 having complete explanation on same.

6. Filter relevant Sentences based on Query

Lets formulate this step as it is quite important part,

Lets assume we have list of sentences S (~ 37 Lac sentences) and want to find top K sentences which can probably have answer of question Q.

Any idea how we can approach it ? Please write in comment sections.

Finding relevant sentences given query can be experimented in several ways.

Word Presence Check: Simply search important query words (L) come from step — 5 in list of sentences (S) and selected sentences S` can be those where L is present. Limitation: May get noisy sentence as well as part of filtered sentences and it necessarily not to be relevant.
Information retrieval model: Think of this problem as Search engine problem (Ranking problem) where you have documents and need to retrieve top k relevant documents based on query. Here document is equivalent to sentence. Use BM-25 or Tf-IDF to get vector representation for sentence/query and use cosine similarity between them. Or Use Latent Semantic Indexing (LSI) where each document and term is expressed as a vector with elements corresponding to concepts.
Semantic Similarity: Get embedding at query level and sentence level and compute cosine similarity between them. We can use Google’s Universal Sentence Encoder or Facebook’s Infersent model to get embedding at query andsentence level. Refer to my kaggle talk on Semantic Text Similarity or Jupyter Notebook. Please comment if you can think of any limitation (Hint: Is it write to use this semantic model, Will Pre-trained embedding work specifically on this dataset).
Tuning BERT QA: Get some training data specifically for this data and fine tune BERT QA on same. Training data should have two independent features (Query and Sentence) and Output variable will be binary (0 or 1 i.e. sentence contains query answer or not). Currently, my understanding is we don’t have any pre-trained model in which if we give sentence and query, it returns 0 or 1. Limitation: We don’t have training data and I doubt if we can use SQUAD2.0 here. Also, I experienced BERT works quite slow, may be we can use Albert (Light weight Bert).

7. Use BERT Model for QA System

Once, you have K relevant sentences for a given query, need to find span of text from sentence which answers the question. We will use BERT model fine tuned on SQUAD dataset and hugging face library provide easy way to use that.

!pip install transformersfrom transformers import pipeline
nlp = pipeline('question-answering',model = 'bert-large-cased-whole-word-masking-finetuned-squad')query_sample = "How to prevent Corona ?"
relevant_sentence = "When asked why they were wearing masks, several students answered that they were "preventing corona"".nlp(question = query_sample, context = relevant_sentence)

Image6: Demo video of QA System on Corona

Hurray !! You learned how to develop QA System on corona dataset.

Please clap if article helps you and share with your friends as well.

Happy Learning !!

Implement Question Answering System on Corona : Approach — 01

Hurray !! You learned how to develop QA System on corona dataset.

References

Written by Aakash Goel