End to End your Retrieval Augmented Generation (RAG) pipelines integrating LLM Models (SOTA)
The TriModal Retrieval-Augmented Generation (T-RAG) Project is an advanced AI system that combines the power of text, image, and audio data for multi-modal retrieval and generation tasks. This project leverages state-of-the-art deep learning models, and cutting-edge supportive frameworks such as Langchain, DVC, and ZenML. Consequently, a shared embedding space can be built more efficiently where data from all three modalities can be processed, retrieved, and used in a generative pipeline.
The primary goal of this system is to enhance traditional information retrieval by integrating cross-modal knowledge, a fusion mechanism enabling the model to retrieve and generate accurate, context-aware responses that span multiple data types. Whether the task involves answering questions based on text, recognizing patterns in images, or interpreting sounds, the TriModal RAG framework is designed to handle and fuse these distinct types of data into a unified response.
From release:
pip install trimodal-rag
Alternatively, from source:
git clone https://github.com/tph-kds/TriModalRAG_System.git
Or using docker container with our image, you can run:
docker run -p 8000:8000 trimrag/trimrag
This is a small example program you can run to see trim_rag
in action!
# You can setup inputs following yourselves:
# Let provide a query for chatbot response, as a below example.
query = "Does Typhoon Yagi have damages in Vietnam country and what were the consequences?"
# Create a folder which contains your data to run this model.
# example: Naming for data folder is ``data``
#
text = ROOT_PROJECT_DIR / ("data/file.pdf")
image = ROOT_PROJECT_DIR / ("data/image.jpg")
audio = ROOT_PROJECT_DIR / ("data/audio.mp3")
# | Where ROOT_PROJECT_DIR: main folder of this project on your local computer after downloading from my github. |
# Using make for running the quick start on "terminal"
make qt
or
[
# adjust in a folder name: "tests/integration/quick_start.py"
# and run:
python tests/integration/quick_start.py --query==query --text=text --image==image --audio=audio
]
# Ultimately, You would receive a result from chatbot's response
#
# Good Luck! And Thank you for your interesting.
Note
You could also check step by step of this project's workflow such as Data Ingestion, Data Processing, and more... in the tests/integration
folder .
(It is recommended that the dependencies be installed under the Conda environment.)
pip install -r requirements.txt
or run init_setup.sh
file in the project's folder:
<!-- Run this command to give the script execution rights: -->
chmod +x init_setup.sh
<!-- Right now, you can execute the script by typing: -->
bash init_setup.sh
To be detailed requirements on Pypi Website
The required supportive environment uses a hardware accelerator GPUs such as T4 of Colab, GPU A100, etc.
Name | #Text(PDF) | #Image | #Audio |
---|---|---|---|
Quantity | 100 | 100 | 100 |
Topic | "Machine Learning Weather Prediction" | "Weather" | "Weather" |
Type | API | API | API |
Supportive Website | Arxiv | Unsplash | FreeSound |
Feature | Text in research papers | Natural Object - (Non-human) | Natural Sound - (As Rain, Lighting) |
-
The
BERT
(Bidirectional Encoder Representations from Transformers) model is used in the TriModal Retrieval-Augmented Generation (RAG) Project to generate high-quality text embeddings. BERT is a transformer-based model pre-trained on vast amounts of text data, which allows it to capture contextual information from both directions (left-to-right and right-to-left) of a sentence. This makes BERT highly effective at understanding the semantic meaning of text, even in complex multi-sentence inputs. Available on this link -
The
CLIP
(Contrastive Language–Image Pretraining) model, specifically theopenai/clip-vit-base-patch32
variant, is utilized in the TriModal Retrieval-Augmented Generation (RAG) Project. CLIP is a powerful model trained on both images and their textual descriptions, allowing it to learn shared representations between visual and textual modalities. This capability is crucial for multi-modal tasks where text and image data need to be compared and fused effectively. Available on this link -
The
Wav2Vec 2.0
Model -(facebook/wav2vec2-base-960h)
, is a state-of-the-art speech representation learning framework developed by Facebook AI Research. Applying supportive Embedding to advanced models processes raw audio signals to produce rich, context-aware embeddings, having been pre-trained on a vast corpus of speech data. Its ability to seamlessly integrate with text and image modalities enhances the project's overall functionality and versatility in handling diverse data types. Available on this link
-
Use Langchain in multimodal chains connected to each other. Read to be more
-
Apply Rerank Methods by LangChain Cohere library. Github Here
-
Logo is generated by @tranphihung
- Overall and comprehensive assessment of project performance.
- Upgrade usually and integrate plenty of new positive models and technologies.
- Optimal Response capabilities result higher than currently available.
- Experiment with increasing and expanding larger dataset inputs.
Stay tuned for future releases as we are continuously working on improving the model, expanding the dataset, and adding new features.
Thank you for your interest in my project. We hope you find it useful. If you have any questions, please feel free and don't hesitate to contact me at tranphihung8383@gmail.com
-
Chen, Wenhu, et al. "Murag: Multimodal retrieval-augmented generator for open question answering over images and text." arXiv preprint arXiv:2210.02928 (2022). Available on this link.
-
VIDIVELLI, S.; RAMACHANDRAN, Manikandan; DHARUNBALAJI, A. Efficiency-Driven Custom Chatbot Development: Unleashing LangChain, RAG, and Performance-Optimized LLM Fusion. Computers, Materials & Continua, 2024, 80.2. Available on this link.
-
DE STEFANO, Gianluca; PELLEGRINO, Giancarlo; SCHÖNHERR, Lea. Rag and Roll: An End-to-End Evaluation of Indirect Prompt Manipulations in LLM-based Application Frameworks. arXiv preprint arXiv:2408.05025, 2024. Available on this link.