Refer to this Notion Link for detailed documenation drafts.
This project is a Retrieval-Augmented Generation (RAG) model specifically designed to educate users on gynecological topics through a conversational chatbot. This chatbot leverages a Local Large Language Model (LLM) and is optimized to run on CPU, making it suitable for devices without a GPU. The pipeline integrates various tools and techniques to ensure accuracy, contextual memory, and efficiency in resource usage.
To run the project with Docker, follow these steps:
1.1 Build the Docker Image Open your CLI and execute the following command:
```bash
docker build -t medbot-image .
You can replace medbot-image
with any other tag you prefer.
1.2 Run the Docker Container To start the container, run:
docker run -p 8501:8501 medbot-image # For Streamlit on port 8501
Note
To run the backend API, ensure that you use port 8000 for the FastAPI app (main.py
).
Since this project depends on the Ollama LLM model, which is loaded locally on the CPU, Docker might encounter issues during inference. Therefore, to avoid these errors, it is recommended to run the project locally without Docker.
2.1 Update Paths: Locate any relative paths in the project code, uncomment them, and replace them with absolute paths as necessary.
2.2 Run the Streamlit App:
-
Open a Bash terminal.
-
Execute the following command:
streamlit run st_app.py
This will launch the Streamlit front end locally.
- Project Overview
- Pipeline Overview
- Model Selection
- Dataset Preparation
- Text Chunking Strategy
- Vector Database Choice
- Retriever and Conversational Chain
- Evaluation
- Future Enhancements
This project focuses on developing an AI chatbot for gynecological education, aiming to answer common questions related to gynecology. The system is designed to be modular and scalable, with a retrieval-augmented generation approach that leverages LangChain for chaining responses and Pinecone as the vector database for efficient information retrieval.
The development pipeline consists of the following steps:
- Model Selection: Choosing a suitable medical LLM for CPU-only environments.
- Data Preparation: Gathering and processing relevant gynecological information from various sources.
- Text Chunking: Segmenting the data into manageable chunks for better model comprehension.
- Vector Database Selection: Storing embeddings and efficiently retrieving relevant chunks.
- Retriever and Conversational Chain Setup: Implementing contextual memory for seamless user interactions.
- Evaluation: Ensuring model accuracy using a mix of ground truth checks, LLM scoring, and retriever accuracy tests.
Since the system runs entirely locally on a CPU, model selection was a critical step. Initially, I explored different medical LLMs by examining the medical leaderboard on Hugging Face. Options included:
- Open-source medical LLMs with fine-tuning on specialized datasets.
- Generic open-source LLMs capable of handling medical knowledge.
- Commercial API models (not suitable for our local setup).
After comparing performance, I chose MedLLaMA, ranked 8th among open-source medical LLMs, due to its fine-tuning on medical data. However, as MedLLaMA requires a GPU, I adapted my setup by using Ollama to run a locally available LLaMA v3.2 (7B parameters) model, balancing performance with hardware limitations.
Choosing MedLLaMA initially was due to its strong medical dataset fine-tuning, crucial for providing accurate gynecological information. Ultimately, LLaMA v3.2 was used in development to align with hardware constraints and avoid overburdening the system.
For version 1 of the chatbot, I prioritized data that would be comprehensive and authoritative:
- Gynecology-focused PDFs – These included textbooks and medical case studies.
- QA Dataset – Aggregated from various medical Q&A websites.
- Web-Scraped Articles – Supplementary information from reputable sources.
Starting with the PDF-based dataset ensured a solid foundation of structured, authoritative knowledge. Later versions can incorporate more diverse datasets to improve answer specificity.
For efficient processing, I tested three chunking methods:
- Page-Based Split – Dividing content per page.
- Recursive Text Splitter – Ideal for books, splitting text based on natural language boundaries.
- Semantic Splitter – Used to break content into coherent, meaning-based sections.
After testing, the recursive text splitter proved optimal for PDF-based content, achieving a good balance between speed and relevance. The semantic splitter, although conceptually ideal, took over 30 minutes to process even 100 pages of an 800-page document, making it infeasible for this setup.
- ChromaDB – An open-source vector database with limitations in monitoring, storage, and high-dimensionality handling.
- Pinecone – A cloud-based solution offering efficient indexing, monitoring, and scalability.
I selected Pinecone for its robustness in handling large datasets and its monitoring capabilities, which streamline development and enhance performance. Pinecone also supports high-dimensionality vectors, which are essential for accurate retrieval in RAG.
The system incorporates a Retriever Chain using LangChain, which enables conversation flow and contextual memory management. This setup allows the chatbot to recall previous interactions, enhancing the overall user experience.
Using LangChain's retriever chain allows seamless integration of tools and makes it easier to maintain a modular approach. The contextual memory feature is crucial for maintaining coherence across interactions, as users may ask follow-up questions.
Evaluation is essential to gauge the chatbot's accuracy and reliability. I used three primary methods:
- QA Ground Truth Comparison – Comparing model responses with established QA pairs.
- LLM-Based Scoring – Using LLMs to rate response relevance and coherence.
- Retriever Accuracy Check – Ensuring the retriever fetches relevant information.
The evaluation framework was developed based on guidelines from Hugging Face’s Cookbook on RAG Evaluation.
These evaluation methods ensure the chatbot's responses are accurate, relevant, and contextually appropriate. Ground truth QA checks establish a baseline, while LLM scoring and retriever accuracy enhance precision.
This project is currently in its initial version, focusing on foundational setup and data accuracy. Planned improvements include:
- Expanding Dataset – Incorporating web-scraped articles and QA datasets for broader coverage.
- Real-Time Feedback Integration – Allowing users to rate responses to improve future interactions.
- Enhanced Chunking and Retrieval – Exploring faster and more efficient chunking methods for larger datasets.
This RAG-based Gynecological Education Chatbot was developed with a focus on adaptability and efficiency, considering CPU limitations and resource constraints. The project aims to provide accessible, accurate, and relevant medical information through a local setup, making it suitable for educational purposes in environments with limited GPU access.
cd .
docker build -t medbot-image .
docker run -p 8000:8000 medbot-image