The intuition behind this project was to build a chatbot that mimics a Retriever-Generator approach for Open-Domain Question-Answering (QA). Our QA pipeline, thus, unfolds in three key stages:
1. YAKE does Keyword Extraction 🔑
- Leveraging YAKE, our Keyword Extractor, we identify important keywords from the questions.
- Configurations, like the number of keywords and maximum words per keyword, shape the process.
- Unearthing answers requires context. We adopt a two-pronged strategy.
- We pinpoint Wikipedia articles related to selected keywords, essentially turning Wikipedia into our knowledge base i.e., our domain.
- A straightforward similarity mechanism filters and merges relevant articles to craft a relevant context for our questions.
3. RoBERTa finds the Answer 🤖
- Armed with the question and context, RoBERTa steps in to predict the answer.
- Defines and initializes the
ScienceChatBot
class with configuration data from a YAML file (usingconfig.yaml
). - Implements functions for keyword extraction, fetching Wikipedia articles, filtering and combining article content, and predicting answers based on user questions.
NOTE: Detailed documentation about the individual functions can be found within the
qamodel.py
file.
- Renders the HTML template for the chatbot interface.
- Initializes ScienceChatBot and predicts answers for user questions using the
predict_answer
method. - Processes user input, gets the predicted answers, and returns the responses in JSON format.
- HTML template for the custom SciBot UI.
- Uses Internal styling to render the interface and AJAX server to send user input to and receive chatbot output from the backend
- Specifies configuration settings for SciBot.
- Contains unit test cases to test the basic functionalities of SciBot.
Before diving into the project, let's set up the groundwork. First, activate the project's virtual environment using poetry:
poetry shell
Afterward, install the essential dependencies:
poetry install
With these steps complete, the project can be run using the Custom UI!
As a full-stack engineer, I couldn't help but include a basic UI built using Flask for a richer experience. Follow these steps to interact with the SciBot UI:
Set Flask to the app file and development environment:
export FLASK_APP=chatbot
export FLASK_ENV=development
Then, launch the Flask app:
flask run
The app will be up and running on the server http://127.0.0.1:5000/
A demo featuring the Custom UI: (The gif may render slower than the actual speed!)
The chatbot was tasked with answering some of the questions from the SciQ dataset. There are 2 aspects that can be evaluated from the responses:
- Context Quality Assessment: Examining the effectiveness of the chatbot in retrieving relevant context following keyword extraction, and
- Model Accuracy Assessment: Evaluating the precision of the model in predicting accurate answers based on the retrieved context.
NOTE: Since SciBot itself fetches the context for the question through keyword extraction, the context supplied with the SciQ dataset is NOT used.
Question | Actual Answer | Predicted Answer |
---|---|---|
Through which process are plants able to make their own food? | photosynthesis | photosynthesis |
Each specific polypeptide has a unique linear sequence of which acids? | amino | amino acids |
What is the most common type of anemia? | iron-def | Iron-deficiency anemia |
What is the process by which the nucleus of a eukaryotic cell divides? | mitosis | mitosis |
What mineral is used in jewelry because of its striking greenish-blue color? | turquoise | malachite |
What are hydrocarbons most important use? | fuel | fuels and chemicals |
When a hypothesis is repeatedly confirmed, what can it then become? | theory | part of a theory |
The effect of acetylcholine in heart muscle is inhibitory rather than what? | excitatory | excitatory |
What is process of producing eggs in the ovary called? | oogenesis | meiosis |
A phase diagram plots pressure and what else? | temperature | temperature |
Energy resources can be put into two categories — renewable or? | nonrenewable | non-renewable |
Who proposed the theory of evolution by natural selection? | darwin | Charles Darwin & Alfred Russel Wallace |
What is the term for the secretion of saliva? | salivation | spit |
Caffeine and alcohol are two examples of what type of drug? | psychoactive | stimulant |
Sometimes referred to as air, what do we call the mixture of gases that surrounds the planet? | atmosphere | The atmosphere of Earth |
Who was the first person known to use a telescope to study the sky? | galileo | Galileo Galilei |
The unit test cases can be executed by running the following command:
python -m unittest qatest.py
The latest code has been tested against these test cases locally. Below is a screenshot showing the test results: