Skip to content

Latest commit

 

History

History
 
 

254-llm-chatbot

Create LLM-powered Chatbot using OpenVINO

In the rapidly evolving world of artificial intelligence (AI), chatbots have emerged as powerful tools for businesses to enhance customer interactions and streamline operations. Large Language Models (LLMs) are artificial intelligence systems that can understand and generate human language. They use deep learning algorithms and massive amounts of data to learn the nuances of language and produce coherent and relevant responses. While a decent intent-based chatbot can answer basic, one-touch inquiries like order management, FAQs, and policy questions, LLM chatbots can tackle more complex, multi-touch questions. LLM enables chatbots to provide support in a conversational manner, similar to how humans do, through contextual memory. Leveraging the capabilities of Language Models, chatbots are becoming increasingly intelligent, capable of understanding and responding to human language with remarkable accuracy.

Previously, we already discussed how to build instruction-following pipeline using OpenVINO and Optimum Intel, please check out Dolly v2 example for reference. In this tutorial we consider how to use power of OpenVINO for running Large Language Models for both chat and QA over document. We will use a pre-trained model from the Hugging Face Transformers library. To simplify the user experience, the Hugging Face Optimum Intel library is used to convert the models to OpenVINO™ IR format. In addition, we will use LangChain to augmenting LLM knowledge with additional data, which allow you to build AI applications that can reason about private data or data introduced after a model’s cutoff date.

The tutorial supports different models, you can select one from provided options to compare quality of open source LLM solutions.

Note: conversion of some models can require additional actions from user side and at least 64GB RAM for conversion.

The available options are:

  • tiny-llama-1b-chat - This is the chat model finetuned on top of TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens with the adoption of the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint. More details about model can be found in model card.
  • red-pajama-3b-chat - A 2.8B parameter pretrained language model based on GPT-NEOX architecture. It was developed by Together Computer and leaders from the open-source AI community. The model is fine-tuned on OASST1 and Dolly2 datasets to enhance chatting ability. More details about model can be found in HuggingFace model card.
  • llama-2-7b-chat - LLama 2 is the second generation of LLama models developed by Meta. Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. llama-2-7b-chat is 7 billions parameters version of LLama 2 finetuned and optimized for dialogue use case. More details about model can be found in the paper, repository and HuggingFace model card

Note: run model with demo, you will need to accept license agreement. You must be a registered user in 🤗 Hugging Face Hub. Please visit HuggingFace model card, carefully read terms of usage and click accept button. You will need to use an access token for downloading model. For more information on access tokens, refer to this section of the documentation.

  • mpt-7b-chat - MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference. These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases (ALiBi). Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence. MPT-7B-chat is a chatbot-like model for dialogue generation. It was built by finetuning MPT-7B on the ShareGPT-Vicuna, HC3, Alpaca, HH-RLHF, and Evol-Instruct datasets. More details about model can be found in blog post, repository and HuggingFace model card.
  • qwen-7b-chat - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. For more details about Qwen, please refer to the GitHub code repository.
  • chatglm3-6b - ChatGLM3-6B is the latest open-source model in the ChatGLM series. While retaining many excellent features such as smooth dialogue and low deployment threshold from the previous two generations, ChatGLM3-6B employs a more diverse training dataset, more sufficient training steps, and a more reasonable training strategy. ChatGLM3-6B adopts a newly designed Prompt format, in addition to the normal multi-turn dialogue. You can find more details about model in the model card
  • mistral-7b - The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. You can find more details about model in the paper and release blog post.
  • zephyr-7b-beta - Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-beta is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). You can find more details about model in technical report and HuggingFace model card.
  • neural-chat-7b-v3-1 - Mistral-7b model fine-tuned using Intel Gaudi. The model fine-tuned on the open source dataset Open-Orca/SlimOrca and aligned with Direct Preference Optimization (DPO) algorithm. More details can be found in model card and blog post.
  • notus-7b-v1 - Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO). and related RLHF techniques. This model is the first version, fine-tuned with DPO over zephyr-7b-sft. Following a data-first approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. Proposed approach for dataset creation helps to effectively fine-tune Notus-7b that surpasses Zephyr-7B-beta and Claude 2 on AlpacaEval. More details about model can be found in model card.
  • youri-7b-chat - Youri-7b-chat is a Llama2 based model. Rinna Co., Ltd. conducted further pre-training for the Llama2 model with a mixture of English and Japanese datasets to improve Japanese task capability. The model is publicly released on Hugging Face hub. You can find detailed information at the rinna/youri-7b-chat project page.

The image below illustrates the provided user instruction and model answer examples.

example

Notebook Contents

The tutorial consists of the following steps:

  • Install prerequisites
  • Download and convert the model from a public source using the OpenVINO integration with Hugging Face Optimum.
  • Compress model weights to INT4 or INT8 precision using NNCF
  • Create an inference pipeline
  • Run chatbot / QA over document

Installation Instructions

If you have not installed all required dependencies, follow the Installation Guide.