Skip to content

Chat to LLaMa 2 that also provides responses with reference documents over vector database. Locally available model using GPTQ 4bit quantization.

License

Notifications You must be signed in to change notification settings

seonglae/llama2gptq

Repository files navigation

LLaMa2 GPTQ

Chat AI which can provide responses with reference documents by Prompt engineering over vector database. It suggests related web pages provided through the integration with my previous product, Texonom.

Pursuing local, private and personal AI without requesting external API attained by optimizing inference performance with GPTQ model quantization. This project was inspired by the langchain projects like notion-qa, localGPT.

Demos

CLI Demo

cli.mp4

Chat Demo

chat.mp4

Install

This project is using rye as package manager Currently only available with CUDA

rye sync

or using pip

CUDA_VERSION=cu118
TORCH_VERSION=2.0.1
pip install torch==$TORCH_VERSION --index-url https://download.pytorch.org/whl/$CUDA_VERSION --force
pip install torch==$TORCH_VERSION --index-url https://download.pytorch.org/whl/$CUDA_VERSION
pip install .

QA

1. Chat with Web UI

streamlit run chat.py

2. Chat with CLI

python main.py chat

Ingest Documents

Currently code structure is mainly focussed on Notion's csv exported data

Custom source documents

# Put document files to ./knowledge folder
python main.py process
# Or use provided Texonom DB
git clone https://huggingface.co/datasets/texonom/md-chroma-instructor-xl db

Quantize Model

Default model is orca 3b for now

python main quantize --source_model facebook/opt-125m --output opt-125m-4bit-gptq --push

Future Plan

  • MPS support using dynamic model selecting
  • Stateful Web App support like chat-langchain

App Stack

LLM Stack

Python Stack

  • Rye for package management
  • Mypy for type checking
  • Fire for CLI implementation
  • Streamlit for Web UI implementation