This project aims to create a bot that harnesses the power of LLM's like Llama 2 to convert natural language queries to SQL queries. We aim to achieve state-of-the-art scores against the latest benchmarks such as Spider and BIRD. We will explore the technique of "In-context learning".
We will assume we are working with Llama 2 13billion. In order to achieve better performance we have to provide the LLM with all the information it needs to process a particular query (Schema linking information). We need to include the schema information in the prompt. We need to specify what values can occur in each column and what they mean so the model can understand the natural language query and map it to the corresponding value in the column of the database table. For example: Type 1 diabetes mellitus might be abbreviated to T1D in the database. We also need to give the LLM access to external knowledge such as:
- Numerical reasoning
- Special domain knowledge: Example query: Find the patients who have an abnormal level of blood pressure. Explanation: The LLM should know what level of BP is considered normal, it should correlate this to the correct unit representation in the column of the database table.
- Synonym knowledge
- Value illustration as mentioned above.
- Zeroshot Inference
- Oneshot Inference (pre-determined/static examples)
- Fewshot inference (pre-determined/static examples)
- Oneshot inference (dynamically selected examples based on similarity/MMR/n-gram overlap etc)
- Fewshot inference (dynamically selected examples based on similarity/MMR/n-gram overlap etc)