Billionaires RAG Query
is a Retrieval-Augmented Generation (RAG) framework designed to ingest and analyze the world's billionaires list, including details such as names, net worth, age, nationality, and primary sources of wealth. This project demonstrates how to use LLMs to interpret structured tabular data within textual documents, providing precise answers to queries about the wealthiest individuals.
- Ingest Billionaires Data: Extract data from the world's billionaires list, including key attributes like name, net worth, age, nationality, and primary sources of wealth.
- Enhanced Query Resolution: Use structured data as context for LLMs to answer complex questions about billionaires, such as "Who is the richest person in 2023?" or "What is the net worth of the sixth richest billionaire?".
- Multi-Format Support: Convert tabular data into multiple formats like JSON, CSV, XML, and Markdown for flexible LLM processing.
- Accurate Information Retrieval: Validate LLM responses against structured data to minimize errors and avoid misinformation.
- Integration with RAG Systems: Seamlessly integrate this tabular data ingestion approach with RAG frameworks to provide richer and more accurate insights.
Make sure asdf
is installed by following the instructions at asdf-vm.com.
-
Add the Python plugin:
asdf plugin-add python
-
Install the required Python version:
asdf install python 3.13.0
-
Set the installed version as the local version for the project:
asdf local python 3.13.0
-
Verify the Python version:
python --version
- Install
poetry
usingasdf
asdf plugin-add poetry https://github.com/asdf-community/asdf-poetry.git
asdf install
OR
Install Poetry by following the instructions at python-poetry.org.
-
Clone the repository:
git clone https://github.com/yourusername/billionaires-rag-query.git cd billionaires-rag-query
-
Install the dependencies:
poetry install
This will create a virtual environment and install all required packages.
To activate the virtual environment managed by Poetry, run:
poetry shell
Once the Poetry environment is active, run the program using:
poetry run python main.py
Set up libraries for table extraction and tabular display:
import pandas as pd
from beautifultable import BeautifulTable
import camelot
Use Camelot to extract the billionaires list from a PDF file:
df = get_tables("./World_Billionaires_Wikipedia.pdf", pages=[3])
Convert the extracted tables into various formats like JSON, CSV, Markdown, and more:
eval_df = prepare_data_formats(df)
Set up a connection to an OpenAI model and run queries using the tabular data as context:
query = "Who is the richest person in 2023?"
result_df = run_question_test(query, eval_df)
Display the LLM's response for each data format:
table = BeautifulTableformat(query, result_df, 150)
print(table)
- Query: "What is Elon Musk's net worth?"
- Output: A table displaying responses for each data format, showing the model's ability to interpret and respond accurately based on the billionaires list.
Contributions are welcome! Please fork the repository and create a pull request with your improvements or bug fixes.
This project is licensed under the MIT License.