Skip to content

Latest commit

 

History

History
166 lines (118 loc) · 8.95 KB

MODELS.md

File metadata and controls

166 lines (118 loc) · 8.95 KB

Reginald Models

We currently have the following models available:

  • hello: a simple model which responds to a message with a greeting and an emoji
  • llama-index-llama-cpp: a model which uses the llama-index library to query a data index and then uses a quantised LLM (implemented using llama-python-cpp) to generate a response
  • llama-index-hf: a model which uses the llama-index library to query a data index and then uses an LLM from Huggingface to generate a response
  • llama-index-gpt-azure: a model which uses the llama-index library to query a data index and then uses the Azure OpenAI API to query a LLM to generate a response
  • llama-index-gpt-openai: a model which uses the llama-index library to query a data index and then uses the OpenAI API to query a LLM to generate a response
  • chat-completion-azure: a chat completion model which uses the Azure OpenAI API to query a LLM to generate a response (does not use llama-index)
  • chat-completion-openai: a chat completion model which uses the OpenAI API to query a LLM to generate a response (does not use llama-index)

llama-index Models

The library has several models which use the llama-index library which allow us to easily augment an LLM with our own data. In particular, we use llama-index to ingest several data sources, including several public sources:

And also some private sources from our private GitHub repositories (using the repo's Wiki pages, issues and some selected files).

All of these (besides the public Turing website) are loaded using llama-hub GitHub readers. Hence, when we are first building up the data index, we must set up the GitHub access tokens (see the README for more details), and you will only be able to build the all_data data index if you have access to our private repositories.

Data index options

When running the Reginald Slack bot, you can specify which data index to use using the LLAMA_INDEX_WHICH_INDEX environment variable (see the environment variables README for more details). The options are:

  • handbook: only builds an index with the public REG handbook
  • wikis: only builds an index with private REG repo Wiki pages
  • public: builds an index with the all the public data listed above
  • all_data: builds an index with all the data listed above including data from our private repo

Once a data index has been built, it will be saved in the data directory specified in the reginald run_all (or reginald run_all_api_llm) CLI arguments or the LLAMA_INDEX_DATA_DIR environment variable. If you want to force a new index to be built, you can use the --force-new-index or -f flag, or you can set the LLAMA_INDEX_FORCE_NEW_INDEX environment variable to True.

There are several options of the LLM to use with the llama-index models, some of which we have implemented in this library and which we discuss below.

llama-index models with self-hosted LLM

We have two models which involve hosting the LLM ourselves and using the llama-index library to query the data index and then generate a response using the LLM. These models are:

llama-index-llama-cpp Model

This model uses the llama-cpp-python library to host a quantised LLM. In our case, we have been using quantised versions of Meta's Llama-2 model uploaded by TheBloke on Huggingface's model hub. An example of running this model locally is:

reginald run_all \
  --model llama-index-llama-cpp \
  --model-name https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf \
  --mode chat \
  --data-dir data/ \
  --which-index handbook \
  --max-input-size 4096 \
  --n-gpu-layers 2

Note that the --n-gpu-layers argument is optional and specifies the number of layers to offload to the GPU. If not specified, it will default to 0. See the llama-cpp-python README to see how you can install the library with hardware acceleration.

Running this in a root of this repository will automatically pick up the data indices for the handbook in data/llama_index_indices/handbook directory.

Running this command requires about 7GB of RAM. We were able to run this on our M1 Pro (32GB) macbook pros with no issues and were able to run the Llama-2-13B-chat model too.

If you wish to download the quantised model (as a .gguf file) and host it yourself, you can do so by passing the file name to the --model-name argument and using the --is-path flag (alternatively, you can re-run the above but first set the environment variable LLAMA_INDEX_IS_PATH to True):

reginald run_all \
  --model llama-index-llama-cpp \
  --model-name gguf_models/llama-2-7b-chat.Q4_K_M.gguf \
  --is-path \
  --mode chat \
  --data-dir data/ \
  --which-index handbook \
  --max-input-size 4096 \
  --n-gpu-layers 2

given that the llama-2-7b-chat.Q4_K_M.gguf file is in a gguf_models directory.

llama-index-hf Model

This model uses an LLM from Huggingface to generate a response. An example of running this model locally is:

reginald run_all \
  --model llama-index-hf \
  --model-name microsoft/phi-1_5 \
  --mode chat \
  --data-dir data/ \
  --which-index handbook \
  --max-input-size 2048 \
  --device auto

Note currently the microsoft/phi-1_5 model has a predefined maximum length of 2048 context length. Hence, we must set the --max-input-size argument to be less than or equal to 2048 as the default value for this argument is 4096 as we tend to use the llama-cpp-python model more. We also set the --device argument to be auto so that the model will be run on any hardware acceleration if available.

llama-index models using an API

We have two models which use an API to query a LLM to generate a response. These models are:

llama-index-gpt-azure Model

To use this model, you must set the following environment variables:

  • OPENAI_AZURE_API_BASE: API base for Azure OpenAI
  • OPENAI_AZURE_API_KEY: API key for Azure OpenAI

An example of running this model locally is:

reginald run_all \
  --model llama-index-gpt-azure \
  --model-name "reginald-gpt35-turbo" \
  --mode chat \
  --data-dir data/ \
  --which-index handbook

Note that "reginald-gpt35-turbo" is the name of our deployment of the "gpt-3.5-turbo" model on Azure. This probably is different on your deployment and resource group on Azure.

llama-index-gpt-openai Model

To use this model, you must set the OPENAI_API_KEY environment variable and set this to be an API key for OpenAI.

An example of running this model locally is:

reginald run_all \
  --model llama-index-gpt-openai \
  --model-name "gpt-3.5-turbo" \
  --mode chat \
  --data-dir data/ \
  --which-index handbook

chat-completion Models

The library also has several models which use the OpenAI API (or the Azure OpenAI API) to query a LLM to generate a response. These models do not use the llama-index library and hence do not use a data index - these are purely chat completion models.

chat-completion-azure Model

To use this model, you must set the following environment variables:

  • OPENAI_AZURE_API_BASE: API base for Azure OpenAI
  • OPENAI_AZURE_API_KEY: API key for Azure OpenAI

An example of running this model locally is:

reginald run_all \
  --model chat-completion-azure \
  --model-name "reginald-curie"

Note that "reginald-curie" is the name of our deployment of a fine-tuned model on Azure. This probably is different on your deployment and resource group on Azure. With Azure's AI Studio, it is possible to fine-tune your own model with Q&A pairs (see here for more details).

chat-completion-openai Model

To use this model, you must set the OPENAI_API_KEY environment variable and set this to be an API key for OpenAI.

An example of running this model locally is:

reginald run_all \
  --model chat-completion-openai \
  --model-name "gpt-3.5-turbo"