An LLM-based application that communicates with a DB to extract information.
The application takes a user input (question related to or inquiry about the DB) in natural language > [converts to SQL query > queries the DB > returns result to LLM] (iteratively until the LLM is satisfied) > returns final output to user
To view this Streamlit app on a browser, run it with the following command:
streamlit run PATH\main.py [ARGUMENTS]
- Note: This code connects to the DB using
postgres
. To connect to a different DB, please replace the type of DB and adapter to match your SQL engine of choice (in thepg_uri
variable in the 'helper_funcs.py' module). The rest of the code should work just as fine 👍🏼.
Additional notes:
Make sure to create a .env
file for storing your OpenAI api key and db-connection variables.
Use the following format:
OPENAI_API_KEY = "your_openai_api_key"
DB_USER = "your_db_user"
DB_PASSWORD = "your_db_password"
DB_HOST = "your_db_host"
DB_PORT = int
DB_NAME = "your_db_name"
LangChain 🦜🔗 (a framework to develop applications on LLMs):
- Actively developing SQL agent ability
- Seamless connection with OpenAI
- Open-source
- Under constant development
- Well-documented
- Extensive usage
- Extensive community support
- Reliable
- Text-to-SQL query VS Text-to-SQL query and execution VS SQL Agent
- Open-source
- Great documentation
- Community support
- Customization
- Pure Python
- Seamless integrations
- Deployment
- LLM and ML support
- A good alternative that shares most of Streamlit's pros.
- A suitable a good substitute.
- I personally like Streamlit, and since it was recently acquired by SnowFlake, believe in its potential to grow much faster. Yet, both are great options.
- Used: model='gpt-3.5-turbo-1106'
- Seamless connection with LangChain
- Best performing model in the market
- Regularly updated offering continuous improvements
- Hallucinations
Alternatives: Open-source LLMs 🦙 (pros & cons):
- A plethora of fine-tuned open-source models for sql-generation
- Much more cost-effective
- Docker
- Hallucinations
- Incorporate a combined approach of using an open-source LLM that was fine-tuned for SQL generation alongside LangChain's SQL query execution.
- This is a valid choice, yet it means waiving the main advantages of the SQL Agent. Namely, recovering from errors + answering questions based on the databases’ schema as well as on the databases’ content.
- Open-source and free
- Good support and documentation
- Reliable
- Supports a wide range of extensions
- Seamlessly integrates with Python
- Download the non-commercial IMDB dataset from IMDB datasets
- See IMDB Non-commercial datasets for file description
- Unzip files (
tsv.gz
files) - Preprocess files before uploading to the Database
- Handling null values correctly
- Handling mixed data types in the same column
-
Choose an appropriate SQL Engine
-
Set-up a new database
-
Create tables and columns that correspond to the TSV files
-
Column data types: When deciding on dtypes take into account performance and storage issues and considerations:
str
- checkdf['col'].str.len().max()
to decide onn
forVARCHAR(n)
int
-INTEGER
vsSMALLINT
SMALLINT
relevant for:title.basics.tsv
(startYear
andendYear
columns) andname.basics.tsv
(birthYear
anddeathYear
columns)
float
-DOUBLE PRECISION
bool
-BOOLEAN
List[str]
-VARCHAR(n)[]
unless behaves likeVARCHAR(n)
- Relevant for:
title.akas.tsv
(types
andattribuites
columns) andtitle.basics.tsv
(genres
column)
- Relevant for:
-
Import the preprocess TSV files into the newly created tables
- Exceptionally large files should be broken into smaller files and uploaded sequentially in batches.
- Write errors to file and inspect
error_file.txt
to further handle any problematic records. - Insert inconvertible values as null (mark the tick-box in the 'Data Import' dialogue box)
- Problem: LangChain-OpenAI incompatibility issues due to OpenAI's
ChatCompletion
endpoint change not yet integrated in LangChain.- Solution: Downgrade to
openai==0.28
(versions 1.0.0+ will not work until LangChain is updated accordingly).
- Solution: Downgrade to
- Issue: LangChain
return_intermediate_steps
param isn't yet applicable for LangChain SQL Agents.- Solution: Consider using a Text-to-SQL query (and Execution) instead to access intermediate steps.
- Considerations: LangChain's SQL Agent provides a more flexible and advanced way of interacting with SQL Databases (see Part I above - Frameworks)
- OpenAI
temperature
param - set to value of0
to block model creativeness when translating free text client Qs to SQL queries.
User prompt > LangChain employs a SQL Agent to handle the back-and-forth deliberations between the LLM and the DB (using the SQL tool by proxy) that queries and returns responses from the DB > Until the LLM is satisfied with the result > Return response to client
- Set primary and foreign keys (Database)
- Improve prompt to LLM with few-show examples of natural language requests and corresponding SQL queries (LangChain)
- Improve LLM retrieval with vectorBD storage and similarity comparison to other relevant SQL queries.
- Export the DB to a compressed file
- Bash command -
pg_dump -U [username] -h [hostname] -d [databasename] | gzip > [full_path_of_output_file.sql.gz]
- Bash command -
- Docker
- Set params in
secret.py
to environment variables - Integrate a chatbot to interact with the user, accept secret params and set them accordingly in the virtual environment to establish a live connection to the database.