Implement Question Answering System on Corona : Approach — 01

Image 1: Screenshot of QA System on Corona

Background

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Data-set (CORD-19). CORD-19 includes over 40,000 scholarly articles with full text, about COVID-19, SARS-CoV-2, and related corona viruses (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Can we use Natural Language Processing (NLP) techniques to make better use of data ?

Objective

Construct Question Answering System capable of answering questions related to corona virus specifically questions like What are symptoms of corona, What do we know about COVID-19 risk factors, How to prevent corona etc. i.e. some corona literature based questions.

What you will learn ?

Learn quick and dirty end to end approach for building QA System on corona.

Idea

This article contains naive approach to solve the problem, idea is to simply start with something and try to connect the dots.

Lets start on below idea for our 1st end to end solution:

Solution

Use link to download research articles on corona.

Articles are stored in 4 folder:

## Storing file absolute path in listfrom tqdm import tqdm
tqdm.pandas()
import pandas as pd
import json,os,gc
path1 = 'biorxiv_medrxiv/biorxiv_medrxiv/'
path2 = 'comm_use_subset/comm_use_subset/'
path3 = 'custom_license/custom_license/'
path4 = 'noncomm_use_subset/noncomm_use_subset/'
paths = [path1,path2,path3,path4]
file_names = []
for path in paths:
temp_file_names = os.listdir(path)
file_names.extend([path+file_name for file_name in temp_file_names])

Total number of articles come around ~33k as of on 31-March-2020.

2. Parse data

Downloaded articles are in .json format, need to convert all data into rows and columns where each rows signify one article.

## function to parse datafrom time import time
from copy import deepcopy
import nltk
from nltk import sent_tokenize,word_tokenize
from nltk.corpus import wordnet,stopwords
from nltk.stem import WordNetLemmatizer
# stopwords = set(stopwords.words('english'))
# nltk.download('all')
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()
import os,re,multiprocessing,joblib
from multiprocessing import Pool
from collections import defaultdict
def file_content(file_path):
abstract='';body_text = '';error_count = 0
if os.path.splitext(file_path)[1]=='.json':
f = open(file_path)
f_json = json.load(f)
try:
abstract = f_json['abstract'][0]['text']
except:
error_count+=1
for i in f_json['body_text']:
try:
body_text= body_text+' '+i['text']
except:
error_count+=1
body_text = body_text.strip()
f.close()
return body_text,abstract,error_count
else:
return body_text,abstract,error_count
## Storing article and related information in data-framedf = pd.DataFrame({'file_name':[],'body':[],'abstract':[],'error_count':[]})df['file_name'] = file_names
df['article_no'] = list(range(df.shape[0]))
for ind,info in tqdm(df.iterrows(),total=df.shape[0]): df.loc[ind,'body'],df.loc[ind,'abstract'],df.loc[ind,'error_count'] = \
file_content(file_path=info['file_name'])

Now, all articles stored in data-frame with columns

Image2: Articles stored in data-frame

3. Preprocessing functions

Below is code for some common preprocessing (lemmatization, stopword removal) steps used

corpus_file = 'corpus.txt'
sent_dict_file = 'sent.joblib.compressed'
word_sent_no_dict_file = 'word_sent_no.joblib.compressed'
orig_word_sent_no_dict_file = 'orig_word_sent_no.joblib.compressed'
stopword_file = 'stopword.txt'
## Lemmatization functiondef get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()
def get_lemmatize(sent):
return " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in word_tokenize(sent)])
def parallelize_dataframe(df, func, num_partitions, num_cores):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
def fn_lemmatize(data):
for ind,info in tqdm(data.iterrows(),total=data.shape[0]):
data.loc[ind,'sentence_lemmatized'] = get_lemmatize(sent = info['sentence'])
return data
## removing stopwordsdef words(text): return re.findall(r'\w+', text.lower())
stopwords = list(set(words(open(stopword_file).read())))
def remove_stopwords(sent):
## case conversion - lower case
word_tokens = words(text=sent)
#sent = sent.lower()
#word_tokens = word_tokenize(sent)
## removing stopwords
filtered_sentence = " ".join([w for w in word_tokens if not w in stopwords])
## removing digits
filtered_sentence = re.sub(r'\d+','',filtered_sentence)
## removing multiple space
filtered_sentence = words(text = filtered_sentence)
return " ".join(filtered_sentence)
def fn_stopword(data):
for ind,info in tqdm(data.iterrows(),total=data.shape[0]):
sent = info['sentence_lemmatized']
data.loc[ind,'sentence_lemma_stop'] = remove_stopwords(sent)
return data
def fn_stopword_orig(data):
for ind,info in tqdm(data.iterrows(),total=data.shape[0]):
sent = info['sentence']
data.loc[ind,'sentence_stop'] = remove_stopwords(sent)
return data

4. Convert data in to useful format

## creating sentence dictionary
df['article'] = df['body']+' '+df['abstract']
df['article'].fillna('',inplace=True)
article_no_sent_dict = dict()
for ind,info in tqdm(df.iterrows(),total=df.shape[0]):
article_no_sent_dict[info['article_no']] = sent_tokenize(str(info['article']))
article_no_list = list();sent_list = list()
df_sent = pd.DataFrame({'article_id':[],'sentence':[]})
for i in tqdm(article_no_sent_dict,total=len(article_no_sent_dict)):
article_no_list.extend([i]*len(article_no_sent_dict[i]))
sent_list.extend(article_no_sent_dict[i])
df_sent['article_id'] = article_no_list ; df_sent['sentence'] = sent_list
df_sent['sent_no'] = list(range(df_sent.shape[0]))
## sentence level dictionary
sent_dict = dict()
for ind,info in tqdm(df_sent.iterrows(),total=df_sent.shape[0]):
sent_dict[info['sent_no']] = info['sentence']
sent_dict[-1] = 'NULL'
sent_dict_file = 'sent.joblib.compressed'
joblib.dump(sent_dict,sent_dict_file, compress=True)

Total No. of unique sentences comes after combining all articles ~ 37Lac as of on 31-MARCH-2020.

Image3: Sentences stored in data-frame
## lemmatization over sentence
df1 = deepcopy(df_sent)
df1 = parallelize_dataframe(df=df1, func=fn_lemmatize, num_partitions=27, num_cores=27)
## removing stopword from lemmatized sentence
df1 = parallelize_dataframe(df=df1, func=fn_stopword, num_partitions=30, num_cores=35)
## saving inverse dictionary on lemmatized sentence i.e. word and sentence no
word_sent_no_dict = defaultdict(list)
for ind,info in tqdm(df1.iterrows(),total=df1.shape[0]):
sent_words = words(info['sentence_lemma_stop'])
for w in sent_words:
word_sent_no_dict[w].append(info['sent_no'])
joblib.dump(word_sent_no_dict,word_sent_no_dict_file, compress=True)
## saving inverse dictionary on original sentence i.e. word and sentence nodf1 = deepcopy(df_sent)
df1 = parallelize_dataframe(df=df1, func=fn_stopword_orig, num_partitions=35, num_cores=35)
orig_word_sent_no_dict = defaultdict(list)
for ind,info in tqdm(df1.iterrows(),total=df1.shape[0]):
sent_words = words(info['sentence_stop'])
for w in sent_words:
orig_word_sent_no_dict[w].append(info['sent_no'])
joblib.dump(orig_word_sent_no_dict,orig_word_sent_no_dict_file, compress=True)
## Corpus - for spelling correction modeloutF = open(corpus_file, "w")
for line in tqdm(df_sent['sentence'],total=df_sent.shape[0]):
# write line to output file
outF.write(line)
outF.write("\n")
outF.close()

So, After following above steps, make sure we have following files with us

Image4: Output files from above step

Note: stopword.txt file containing list of stopwords and not generated from above steps, Iused manual list of stopwords, you can also use nltk stopword list.

5. Query Processing System

Idea of this step is to convert free text query into list of important words in query.

Image5: Query Processing System

List of important words from query will be {Lemmatized words + original words + expanded list of words} — {stopwords}.

Input Example (Q): What are symptoms of Corona ?

Output (L): [‘symptoms’,’symptom’,’corona’]

For Spelling correction, used Peter Norvig Spell Corrector and for its working, refer to my video 1, video 2 having complete explanation on same.

6. Filter relevant Sentences based on Query

Lets formulate this step as it is quite important part,

Lets assume we have list of sentences S (~ 37 Lac sentences) and want to find top K sentences which can probably have answer of question Q.

Any idea how we can approach it ? Please write in comment sections.

Finding relevant sentences given query can be experimented in several ways.

7. Use BERT Model for QA System

Once, you have K relevant sentences for a given query, need to find span of text from sentence which answers the question. We will use BERT model fine tuned on SQUAD dataset and hugging face library provide easy way to use that.

!pip install transformersfrom transformers import pipeline
nlp = pipeline('question-answering',model = 'bert-large-cased-whole-word-masking-finetuned-squad')
query_sample = "How to prevent Corona ?"
relevant_sentence = "When asked why they were wearing masks, several students answered that they were "preventing corona"".
nlp(question = query_sample, context = relevant_sentence)
Image6: Demo video of QA System on Corona

Hurray !! You learned how to develop QA System on corona dataset.

Please clap if article helps you and share with your friends as well.

Happy Learning !!

References

--

--

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store