conda create -p med_venv python==3.10 -y
conda activate med_venv/
-
python -m pip install --upgrade pip
-
pip install -r src/requirements.txt
-
conda install jupyter
(to run the jupyter notebook)
- Run
python src/engine.py
to train the model - Run
streamlit run src/app.py
to run the streamlit app
-
The Project aims to train SkipGram and FastText Models on COVID-19 Clinical Trials Dataset and builds a Search Engine where user can type any COVID-19 related keyword and it presents all the top n similar results from the dataset
We all must have wondered that if we search for a particular word in google, it does not show just the results that contain the very same word but also shows results that are very closely related to it. For example, if we search for the term ‘medicine’ in google, you can see results that not just include the word ‘medicine’ but also terms such as "health", "pharmacy", "WHO", and so on. So, google somehow understands that these terms are closely related to each other. This is where word embeddings come into the picture. Word embeddings are nothing but numerical representations of words in a sentence depending on the context.
General word embeddings might not perform well enough on all the domains. Hence, we need to build domain-specific embeddings to get better outcomes. In this project, we will create medical word embeddings using Word2vec and FastText in python.
We are considering a clinical trials dataset for our project based on Covid-19. Dataset-Link
There are 10666 rows and 21 columns present in the dataset. The following two columns are essential for us,
Title
Abstract
This project aims to use the trained models (Word2Vec and FastText) to build a search engine and Streamlit UI.
To develop a machine learning application that can understand the relationship and pattern between various words used together in the field of medical science, create a smart search engine for records containing those terms, and finally build a machine learning pipeline in azure to deploy and scale the application.
- Language - Python
- Libraries and Packages -
Pandas
,Numpy
,Matplotlib
,Plotly
,Gensim
,Streamlit
,NLTK
.
- Check my Jupyter notebooks:
- Importing the required libraries
- Reading the dataset
- Pre-processing
- Remove URLs
- Convert text to lower case
- Remove numerical values
- Remove punctuation.
- Perform tokenization
- Remove stop words
- Perform lemmatization
- Remove ‘\n’ character from the columns
- Exploratory Data Analysis (EDA)
- Data Visualization using word cloud
- Training the ‘Skip-gram’ model
- Training the ‘FastText’ model
- Model embeddings – Similarity
- PCA plots for Skip-gram and FastText models
- Convert abstract and title to vectors using the Skip-gram and FastText model
- Use the Cosine similarity function
- Perform input query pre-processing
- Define a function to return top ‘n’ similar results
- Result evaluation
- Run the Streamlit Application
- Run
streamlit run medical.py
in notebook
- Run
- Understanding the business problem
- Understanding the architecture to build the Streamlit application
- Learning the Word2Vec and FastText model
- Importing the dataset and required libraries
- Data Pre-processing
- Performing basic Exploratory Data Analysis (EDA)
- Training the Skip-gram model with varying parameters
- Training the FastText model with varying parameters
- Understanding and performing the model embeddings
- Plotting the PCA plots
- Getting vectors for each attribute
- Performing the Cosine similarity function
- Pre-processing the input query
- Evaluating the results
- Creating a function to return top ‘n’ similar results for a given query
- Understanding the code for executing the Streamlit application.
- Run the Streamlit application.
- Links to solve some errors