Improve LLMs MongoDB query generation ability with the help of advanced retrieval augmented generation. This project demonstrates a sophisticated approach for improving the generated MongoDB queries from natural language questions using the Large Language Models. It leverages state-of-the-art technologies in natural language processing, vector databases, and advanced retrieval augmented generation to create an efficient and accurate query generation pipeline. It also showcases the use of Weaviate, an open-source vector database, for efficient retrieval of similar questions and their corresponding MongoDB queries.
Overview The project consists of five main components:
- config.json: Configuration file containing API keys, file paths, and Weaviate schema.
- main.py: Main script orchestrating data processing, vector database operations, and query generation.
- pre_process.py: Data processing module for cleaning and embedding questions.
- query_generation.py: Module for generating MongoDB queries using the Gemini Pro model.
- weavite_vector_db.py: Weaviate client for vector database operations.
It also contains a jupyter notebook which gives the full walkthrough of the code and how to use it.
Prerequisites
- Python 3.x.
- Google Gemini API key or any other LLM model.
- Weaviate (embedded mode used in this project).
- sentence-transformers.
Major Components
- Data Preprocessing: Utilizes pandas and sentence-transformers to clean and vectorize question-query pairs and schema information.
- Vector Database: Implements weaviate-client for storing and retrieving semantically similar data objects.
- Query Generation: Uses Google's generativeai (Gemini Pro) for generating MongoDB queries, with an option for Retrieval-Augmented Generation (RAG).
- Similarity Re-ranking: Employs sentence-transformers for re-ranking retrieved examples to improve the relevance of context provided to the query generator.
Installation
- Clone the repository:
git clone https://github.com/Chirayu-Tripathi/MongoDB-Querifier.git
cd MongoDB-Querifier
- Install dependencies:
pip install pandas sentence-transformers google-generativeai weaviate-client
- Update config.json with your API keys:
{
"api_keys": {
"gemini_api_key": "your-gemini-api-key",
}
}
Usage
- Prepare your data:
- mongodb_array_object.csv: Schema information
- mongodb_array_object.txt: Question-query pairs
- Run the main script:
- Set rag=True in main() to enable RAG, or rag=False for non-RAG query generation.
- Update the parameters to the function: query_gen.generate_query(class_name, schemas[schema], question, db_client, prompt, rag)
python main.py
-
The script will output the generated MongoDB query.
-
You can also follow procedure described in the workflow notebook as it is easy to modify at run-time.
How It Works
- Data Preprocessing: The DataProcessor class reads and cleans schema and query data, then generates embeddings for each question using a pre-trained sentence transformer model.
- Vector Database: The WeaviateClient class sets up an in-memory Weaviate database. It creates a class (like a table) with specified properties and adds data objects along with their vector representations.
- Query Generation: The QueryGeneration class uses Google's Gemini Pro model to generate MongoDB queries. When RAG is enabled:
- It encodes the input question and retrieves similar questions from the database.
- It re-ranks these questions using a cross-encoder for better similarity matching.
- It constructs a prompt with the top two similar questions, their schemas, and queries.
- It feeds this prompt to Gemini Pro to generate the MongoDB query.
- Re-ranking: The re-ranking step is crucial. It ensures that the examples provided to the model are not just superficially similar (based on word overlap) but semantically similar. This guides the model to generate more accurate queries.
Why RAG?
Retrieval-Augmented Generation significantly improves query generation:
- It provides context-specific examples, unlike static few-shot prompts.
- It handles complex or uncommon queries better by finding relevant past examples.
- It adapts to the nuances of each question, leading to more accurate and efficient queries.
Consider the following question: Find all the "Sci-Fi" related posts written by Chirayu with post length longer than 50 characters,
LLM enabled with RAG correctly generates
db.posts.find({
$and: [
{ tags: "Sci-Fi" },
{ author: "Chirayu" },
{ $expr: { $gt: [{ $strLenCP: "$body" }, 50] } }
]
}),
understanding that $strLenCP is needed for string length, the reason behind this generation is the use of relevant question-query-schema pairs fetched from the vector store.
While the LLM Without RAG, incorrectly generates
{ $and: [
{ tags: { $in: ["Sci-Fi"] } },
{ author: "Chirayu" },
{ body: { $gt: 50 } }
]
}
, treating body as a number instead of string.
Future Scope
Need to check how this method performs with LLMs fine-tuned on MongoDB question-answer pairs, test this process on fine-tuned Phi-2 from nl2query
License
This project is open-source and available under the MIT License.