MongoDB-Querifier

Improve LLMs MongoDB query generation ability with the help of advanced retrieval augmented generation. This project demonstrates a sophisticated approach for improving the generated MongoDB queries from natural language questions using the Large Language Models. It leverages state-of-the-art technologies in natural language processing, vector databases, and advanced retrieval augmented generation to create an efficient and accurate query generation pipeline. It also showcases the use of Weaviate, an open-source vector database, for efficient retrieval of similar questions and their corresponding MongoDB queries.

Overview The project consists of five main components:

config.json: Configuration file containing API keys, file paths, and Weaviate schema.
main.py: Main script orchestrating data processing, vector database operations, and query generation.
pre_process.py: Data processing module for cleaning and embedding questions.
query_generation.py: Module for generating MongoDB queries using the Gemini Pro model.
weavite_vector_db.py: Weaviate client for vector database operations.

It also contains a jupyter notebook which gives the full walkthrough of the code and how to use it.

Prerequisites

Python 3.x.
Google Gemini API key or any other LLM model.
Weaviate (embedded mode used in this project).
sentence-transformers.

Major Components

Data Preprocessing: Utilizes pandas and sentence-transformers to clean and vectorize question-query pairs and schema information.
Vector Database: Implements weaviate-client for storing and retrieving semantically similar data objects.
Query Generation: Uses Google's generativeai (Gemini Pro) for generating MongoDB queries, with an option for Retrieval-Augmented Generation (RAG).
Similarity Re-ranking: Employs sentence-transformers for re-ranking retrieved examples to improve the relevance of context provided to the query generator.

Installation

Clone the repository:

git clone https://github.com/Chirayu-Tripathi/MongoDB-Querifier.git
cd MongoDB-Querifier

Install dependencies:

pip install pandas sentence-transformers google-generativeai weaviate-client

Update config.json with your API keys:

{
  "api_keys": {
    "gemini_api_key": "your-gemini-api-key",
  }
}

Usage

Prepare your data:

mongodb_array_object.csv: Schema information
mongodb_array_object.txt: Question-query pairs

Run the main script:
- Set rag=True in main() to enable RAG, or rag=False for non-RAG query generation.
- Update the parameters to the function: query_gen.generate_query(class_name, schemas[schema], question, db_client, prompt, rag)

python main.py

The script will output the generated MongoDB query.
You can also follow procedure described in the workflow notebook as it is easy to modify at run-time.

How It Works

Data Preprocessing: The DataProcessor class reads and cleans schema and query data, then generates embeddings for each question using a pre-trained sentence transformer model.
Vector Database: The WeaviateClient class sets up an in-memory Weaviate database. It creates a class (like a table) with specified properties and adds data objects along with their vector representations.
Query Generation: The QueryGeneration class uses Google's Gemini Pro model to generate MongoDB queries. When RAG is enabled:
- It encodes the input question and retrieves similar questions from the database.
- It re-ranks these questions using a cross-encoder for better similarity matching.
- It constructs a prompt with the top two similar questions, their schemas, and queries.
- It feeds this prompt to Gemini Pro to generate the MongoDB query.
Re-ranking: The re-ranking step is crucial. It ensures that the examples provided to the model are not just superficially similar (based on word overlap) but semantically similar. This guides the model to generate more accurate queries.

Why RAG?

Retrieval-Augmented Generation significantly improves query generation:

It provides context-specific examples, unlike static few-shot prompts.
It handles complex or uncommon queries better by finding relevant past examples.
It adapts to the nuances of each question, leading to more accurate and efficient queries.

Consider the following question: Find all the "Sci-Fi" related posts written by Chirayu with post length longer than 50 characters,

LLM enabled with RAG correctly generates

db.posts.find({
  $and: [
    { tags: "Sci-Fi" },
    { author: "Chirayu" },
    { $expr: { $gt: [{ $strLenCP: "$body" }, 50] } }
  ]
}), 
understanding that $strLenCP is needed for string length, the reason behind this generation is the use of relevant question-query-schema pairs fetched from the vector store. 

While the LLM Without RAG, incorrectly generates 
 
 { $and: [
    { tags: { $in: ["Sci-Fi"] } },
    { author: "Chirayu" },
    { body: { $gt: 50 } }
  ]
}
, treating body as a number instead of string.

Future Scope

Need to check how this method performs with LLMs fine-tuned on MongoDB question-answer pairs, test this process on fine-tuned Phi-2 from nl2query

License

This project is open-source and available under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MongoDB-Querifier

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
images		images
notebook		notebook
streamlit		streamlit
README.md		README.md
config.json		config.json
main.py		main.py
mongodb_array_object.csv		mongodb_array_object.csv
mongodb_array_object.txt		mongodb_array_object.txt
pre-processed.csv		pre-processed.csv
pre_process.py		pre_process.py
query_generation.py		query_generation.py
requirements.txt		requirements.txt
weavite_vector_db.py		weavite_vector_db.py

Chirayu-Tripathi/MongoDB-Querifier

Folders and files

Latest commit

History

Repository files navigation

MongoDB-Querifier

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages