Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update quickstart with OpenLLM #4295

Merged
merged 4 commits into from
Nov 20, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ Deploy a large language model with OpenLLM and BentoML

As an important component in the BentoML ecosystem, `OpenLLM <https://github.com/bentoml/OpenLLM>`_ is an open platform designed to facilitate the
operation and deployment of large language models (LLMs) in production. The platform provides functionalities that allow users to fine-tune, serve,
deploy, and monitor LLMs with ease. OpenLLM supports a wide range of state-of-the-art LLMs and model runtimes, such as StableLM, Falcon, Dolly,
deploy, and monitor LLMs with ease. OpenLLM supports a wide range of state-of-the-art LLMs and model runtimes, such as Llama 2, Mistral, StableLM, Falcon, Dolly,
Flan-T5, ChatGLM, StarCoder, and more.

With OpenLLM, you can deploy your models to the cloud or on-premises, and build powerful AI applications. It supports the integration of your LLMs
with other models and services such as LangChain, BentoML, and Hugging Face, thereby allowing the creation of complex AI applications.
with other models and services such as LangChain, LlamaIndex, BentoML, and Hugging Face, thereby allowing the creation of complex AI applications.

This quickstart demonstrates how to integrate OpenLLM with BentoML to deploy a large language model. To learn more about OpenLLM,
you can also try the `OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM <https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/openllm-llama2-demo/openllm_llama2_demo.ipynb>`_.
you can also try the `OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM <https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb>`_.

Prerequisites
-------------
Expand All @@ -32,6 +32,14 @@ Run the following command to install OpenLLM.

pip install openllm

.. note::

If you are running on GPUs, we recommend using OpenLLM with vLLM runtime. Install with

.. code-block:: bash

pip install "openllm[vllm]"

Create a BentoML Service
------------------------

Expand All @@ -40,35 +48,62 @@ Create a ``service.py`` file to define a BentoML :doc:`Service </concepts/servic
.. code-block:: python
:caption: `service.py`

from __future__ import annotations
from __future__ import annotations
import uuid
from typing import Any, AsyncGenerator, Dict, TypedDict, Union

from bentoml import Service
from bentoml.io import JSON, Text
from openllm import LLM

llm = LLM[Any, Any]('HuggingFaceH4/zephyr-7b-beta', backend='vllm')

import bentoml
import openllm

model = "dolly-v2"
svc = Service('tinyllm', runners=[llm.runner])

llm_runner = openllm.Runner(model)

svc = bentoml.Service(name="llm-dolly-service", runners=[llm_runner])
class GenerateInput(TypedDict):
prompt: str
stream: bool
sampling_params: Dict[str, Any]


@svc.on_startup
def download(_: bentoml.Context):
llm_runner.download_model()
@svc.api(
route='/v1/generate',
input=JSON.from_sample(
GenerateInput(prompt='What is time?', stream=False, sampling_params={'temperature': 0.73, 'logprobs': 1})
),
output=Text(content_type='text/event-stream'),
)
async def generate(request: GenerateInput) -> Union[AsyncGenerator[str, None], str]:
n = request['sampling_params'].pop('n', 1)
request_id = f'tinyllm-{uuid.uuid4().hex}'
previous_texts = [''] * n

generator = llm.generate_iterator(request['prompt'], request_id=request_id, n=n, **request['sampling_params'])

@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
answer = await llm_runner.generate.async_run(input_text)
return answer[0]["generated_text"]
async def streamer() -> AsyncGenerator[str, None]:
async for request_output in generator:
for output in request_output.outputs:
i = output.index
delta_text = output.text[len(previous_texts[i]) :]
previous_texts[i] = output.text
yield delta_text

if request['stream']:
return streamer()

final_output = None
async for request_output in generator:
final_output = request_output
assert final_output is not None
return final_output.outputs[0].text

Here is a breakdown of this ``service.py`` file.

- ``model``: The ``model`` variable is assigned the name of the model to be used (``dolly-v2`` in this example). Run ``openllm models`` to view all supported models and their corresponding model IDs. Note that certain models may only support running on GPUs.
- ``openllm.Runner()``: Creates a :doc:`bentoml.Runner </concepts/runner>` instance for the model specified.
- ``bentoml.Service()``: Creates a BentoML Service named ``llm-dolly-service`` and wraps the previously created Runner into the Service.
- ``@svc.on_startup``: Different from the Transformer model quickstart, this tutorial creates an action to be performed when the Service starts using the ``on_startup`` hook in the ``service.py`` file. It calls the ``download_model()`` function to ensure the necessary model and weights are downloaded if they do not exist locally. This makes sure the Service is ready to serve requests when it starts.
- ``@svc.api()``: Defines an API endpoint for the BentoML Service that takes a text input and outputs a text. The endpoint’s functionality is defined in the ``prompt()`` function: it takes in a string of text, runs it through the model to generate an answer, and returns the generated text.
- ``openllm.LLM()``: Creates an LLM abstraction object that allows easy to use APIs for streaming text with optimization built-in. It supports a variety of architectures (See `openllm models` for more information). ``openllm.LLM`` is built on top of a :doc:`bentoml.Runner </concepts/runner>` for this LLM.
- ``bentoml.Service()``: Creates a BentoML Service named ``llm-mistral-service`` and turns the aforementioned `llm.runner` into a `bentoml.Service`.
- ``@svc.api()``: Defines an API endpoint for the BentoML Service that takes a text input and outputs a text. The endpoint’s functionality is defined in the ``generate()`` function: it takes in a string of text, runs it through the model to generate an answer, and returns the generated text. It both supports streaming and one-shot generation.

Use ``bentoml serve`` to start the Service.

Expand All @@ -85,30 +120,51 @@ The server is now active at `http://0.0.0.0:3000 <http://0.0.0.0:3000/>`_. You c

.. tab-item:: CURL

For one-shot generation

.. code-block:: bash

curl -X 'POST' \
'http://0.0.0.0:3000/prompt' \
-H 'accept: text/plain' \
-H 'Content-Type: text/plain' \
-d '$PROMPT' # Replace $PROMPT here with your prompt.

curl -X 'POST' \
'http://0.0.0.0:3000/v1/generate' \
aarnphm marked this conversation as resolved.
Show resolved Hide resolved
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{"prompt": "What is the meaning of life?", "stream": "False", "sampling_params": {"temperature": 0.73}}'

For streaming generation

.. code-block:: bash

curl -X 'POST' -N \
'http://0.0.0.0:3000/v1/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{"prompt": "What is the meaning of life?", "stream": "True", "sampling_params": {"temperature": 0.73}}'

.. tab-item:: Python

For one-shot generation

.. code-block:: bash

import requests
import openllm

client = openllm.HTTPClient('http://localhost:3000')

response = client.generate("What is the meaning of life?", max_new_tokens=256)

print(response.outputs[0].text)

For streaming generation

.. code-block:: bash

import openllm

client = openllm.HTTPClient('http://localhost:3000')

response = requests.post(
"http://0.0.0.0:3000/prompt",
headers={
"accept": "text/plain",
"Content-Type": "text/plain",
},
data="$PROMPT", # Replace $PROMPT here with your prompt.
)
for it in client.generate_stream("What is the meaning of life?", max_new_tokens=256): print(it.text, flush=True, end='')

print(response.text)

.. tab-item:: Browser

Expand Down Expand Up @@ -136,8 +192,8 @@ The model should be downloaded automatically to the Model Store.

$ bentoml models list

Tag Module Size Creation Time
pt-databricks-dolly-v2-3b:f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df openllm.serialisation.transformers 5.30 GiB 2023-07-11 16:17:26
Tag Module Size Creation Time
vllm-huggingfaceh4--zephyr-7b-beta:8af01af3d4f9dc9b962447180d6d0f8c5315da86 openllm.serialisation.transformers 13.49 GiB 2023-11-16 06:32:45

Build a Bento
-------------
Expand All @@ -154,16 +210,16 @@ After the Service is ready, you can package it into a :doc:`Bento </concepts/ben
packages:
- openllm
models:
- pt-databricks-dolly-v2-3b:latest
- vllm-huggingfaceh4--zephyr-7b-beta:latest

Run ``bentoml build`` in your project directory to build the Bento.

.. code-block:: bash

$ bentoml build

Building BentoML service "llm-dolly-service:oatecjraxktp6nry" from build context "/Users/demo/Documents/openllm-test".
Packing model "pt-databricks-dolly-v2-3b:f6c9be08f16fe4d3a719bee0a4a7c7415b5c65df"
Building BentoML service "llm-mistral-service:oatecjraxktp6nry" from build context "/Users/demo/Documents/openllm-test".
Packing model "vllm-huggingfaceh4--zephyr-7b-beta:8af01af3d4f9dc9b962447180d6d0f8c5315da86"
Locking PyPI package versions.

██████╗░███████╗███╗░░██╗████████╗░█████╗░███╗░░░███╗██╗░░░░░
Expand All @@ -173,15 +229,15 @@ Run ``bentoml build`` in your project directory to build the Bento.
██████╦╝███████╗██║░╚███║░░░██║░░░╚█████╔╝██║░╚═╝░██║███████╗
╚═════╝░╚══════╝╚═╝░░╚══╝░░░╚═╝░░░░╚════╝░╚═╝░░░░░╚═╝╚══════╝

Successfully built Bento(tag="llm-dolly-service:oatecjraxktp6nry").
Successfully built Bento(tag="llm-mistral-service:oatecjraxktp6nry").

Possible next steps:

* Containerize your Bento with `bentoml containerize`:
$ bentoml containerize llm-dolly-service:oatecjraxktp6nry
$ bentoml containerize llm-mistral-service:oatecjraxktp6nry

* Push to BentoCloud with `bentoml push`:
$ bentoml push llm-dolly-service:oatecjraxktp6nry
$ bentoml push llm-mistral-service:oatecjraxktp6nry

Deploy a Bento
--------------
Expand All @@ -190,7 +246,7 @@ To containerize the Bento with Docker, run:

.. code-block:: bash

bentoml containerize llm-dolly-service:oatecjraxktp6nry
bentoml containerize llm-mistral-service:oatecjraxktp6nry

You can then deploy the Docker image in different environments like Kubernetes. Alternatively, push the Bento to `BentoCloud <https://bentoml.com/cloud>`_ for distributed deployments of your model.
For more information, see :doc:`/bentocloud/how-tos/deploy-bentos`.
Expand Down