Skip to content

Commit

Permalink
code changes
Browse files Browse the repository at this point in the history
  • Loading branch information
KuuCi committed Aug 30, 2023
1 parent db50618 commit 7245cae
Show file tree
Hide file tree
Showing 11 changed files with 104 additions and 536 deletions.
1 change: 1 addition & 0 deletions .github/workflows/lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ jobs:
- benchmarks/deeplab
- benchmarks/resnet_imagenet
- end-to-end-examples/sec_10k_qa
- end-to-end-examples/support-chatbot
- end-to-end-examples/stable_diffusion
- end-to-end-examples/stable_diffusion_dreambooth
- third-party/nemo
Expand Down
6 changes: 2 additions & 4 deletions examples/end-to-end-examples/support_chatbot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,9 +105,9 @@ python scripts/conversion/convert_jsonl_to_stream.py \

## Step 3: Finetuning on our Repository

Next, we will finetune our pretrained base model on the train split of the our data, whether that be PyPi documentation, MosaicML code base, or dolly in order to tune it on data that is in-domain for the end task of answering questions about the mosaic codebase. This process is called "domain tuning," and can be useful for adapting a model that has already been trained on a huge amount of data (e.g. MPT-7b) to a new domain. For this example, we will use the train/validation(/test) splits provided with the dataset, which can be in a variety of different formats. We will use the validation split as validation data, and reserve the test split if avalible for our final testing of our application.
Next, we will finetune our pretrained model on the train split of the our data, whether that be PyPi documentation, MosaicML code base, or dolly in order to tune it on data that is in-domain for the end task of answering questions about the mosaic codebase. This process is called "domain tuning," and can be useful for adapting a model that has already been trained on a huge amount of data (e.g. MPT-7b) to a new domain. For this example, we will use the train/validation(/test) splits provided with the dataset, which can be in a variety of different formats. We will use the validation split as validation data, and reserve the test split if avalible for our final testing of our application.

Please check out the [training yaml](./mcli-yamls/03_finetune_on_10ks.yaml) for all of the details. This yaml will load the pretrained weights for `mpt-7b` available on the [HuggingFace Hub](https://huggingface.co/mosaicml/mpt-7b), and then train using the normal causal language modeling objective on our datasets that we processed in the previous step. The [training script](https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/train.py) itself, is from LLM-foundry.
Please check out the [training directory](./mcli-yamls/finetune) for all of the details. This yaml will load the pretrained weights for `mpt-7b` available on the [HuggingFace Hub](https://huggingface.co/mosaicml/mpt-7b), and then train using the normal causal language modeling objective on our datasets that we processed in the previous step. The [training script](https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/train.py) itself, is from LLM-foundry.

To run finetuning, run the following where `composer_codebase` can be replaced with `PyPi` or `dolly_hh`

Expand All @@ -124,8 +124,6 @@ mcli run -f mcli_yamls/finetune/finetune_composer_codebase.yaml --cluster CLUSTE

Before we can deploy our model, we need to convert it into the standard HuggingFace checkpoint folder. We will use the [conversion script](https://github.com/mosaicml/llm-foundry/blob/main/scripts/inference/convert_composer_to_hf.py) from LLM-foundry to do this. This script will take the Composer checkpoint, and write out all the files that HuggingFace expects in a checkpoint folder. You can additionally add the `--hf_repo_for_upload` argument if you would like to upload directly to a private repo on the HuggingFace Hub (you will also need to [set the `HUGGING_FACE_HUB_TOKEN` environment variable](https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/env.html) to do this).

Note: this conversion script is _specifically_ for MPT. If you have changed the model to a different HuggingFace model, you can use the `convert_composer_to_hf_transformers.py` script in _this_ repository instead.

**Fields to replace with your values:** `REPLACE_WITH_YOUR_CLUSTER` (in the command), `CLOUD` (in the yaml), `BUCKET_NAME` (in the yaml), `CHECKPOINT_FOLDER_NAME` (in the yaml), `HF_FOLDER_NAME` (in the yaml)

**Inputs:** the final checkpoint from step 4 inside `CHECKPOINT_FOLDER_NAME` and where you want the converted checkpoint to go in `HF_FOLDER_NAME`
Expand Down
66 changes: 62 additions & 4 deletions examples/end-to-end-examples/support_chatbot/app_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@

ROOT_DIR = os.path.dirname(os.path.abspath(__file__))

EVAL_7B_TEMPLATE = (f'Answer the following question as one function, class, or object. If you do not know, just say "I do not know".'
'\n{context}'
'\nQuestion: {question}')

EVAL_30B_TEMPLATE = ("""<|im_start|>system
A conversation between a user and an LLM-based AI assistant about the codebase for the MosaicML library Composer.
Provide a helpful and simple answer given the following context to the question. If you do not know, just say "I
do not know".<|im_end|>
<|im_start|>context
{context}<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant""")

def parse_args() -> Namespace:
"""Parse commandline arguments."""
parser = ArgumentParser(
Expand All @@ -17,9 +31,15 @@ def parse_args() -> Namespace:
'--endpoint_url',
type=str,
default='https://models.hosted-on.mosaicml.hosting/mpt-30b-chat/v1/predict',
#default='https://mpt-30b-composer-finetuned-q8mjj9.inf.hosted-on.mosaicml.hosting/predict',
#default='https://mpt-30b-composer-finetuned-dmhpmi.inf.hosted-on.mosaicml.hosting/predict',
required=False,
help='The endpoint of our MosaicML LLM Model')
parser.add_argument(
'--model_name',
type=str,
default='mpt-30b-chat',
required=False,
help='only evals offered as of now are mpt-30b-chat and mpt-7b')
parser.add_argument(
'--max_length',
type=int,
Expand Down Expand Up @@ -53,9 +73,25 @@ def parse_args() -> Namespace:
parser.add_argument(
'--repository_urls',
type=str,
default='https://github.com/mosaicml/composer,https://github.com/mosaicml/streaming,https://github.com/mosaicml/examples,https://github.com/mosaicml/diffusion,https://github.com/mosaicml/llm-foundry',
nargs='*',
default=['https://github.com/mosaicml/composer',
'https://github.com/mosaicml/streaming',
'https://github.com/mosaicml/examples',
'https://github.com/mosaicml/diffusion',
'https://github.com/mosaicml/llm-foundry'],
required=False,
help='The GitHub repository URLs to download'
)
parser.add_argument(
'--complex_data_dir',
type=str,
required=False,
help='complex eval data for human eval')
parser.add_argument(
'--simple_data_dir',
type=str,
required=False,
help='The GitHub repository URLs to download')
help='simple eval data for string comparison')
parser.add_argument(
'--complex_chat',
type=int,
Expand All @@ -72,15 +108,18 @@ def parse_args() -> Namespace:
return parsed

def main(endpoint_url: str,
model_name: str,
max_length: int,
chunk_size: int,
chunk_overlap: int,
retrieval_k: int,
model_k: int,
repository_urls: list[str],
complex_data_dir: str,
simple_data_dir: str,
chat_version: int) -> None:

retrieval_dir = os.path.join(ROOT_DIR, 'retrieval_data_demo')
retrieval_dir = os.path.join(ROOT_DIR, 'retrieval_data')

embeddings = MosaicMLInstructorEmbeddings()
llm = MosaicML(
Expand Down Expand Up @@ -115,6 +154,22 @@ def chat_wrapper(query: str) -> str:
Returns:
str: The response from chatbot"""
if query == '!eval_simple':
if simple_data_dir is None:
ValueError('No simple data directory provided. Please provide a directory with simple eval data')
if model_name == 'mpt-30b-chat':
return chatbot.evaluate_simple(simple_data_dir, EVAL_30B_TEMPLATE)
elif model_name == 'mpt-7b':
return chatbot.evaluate_simple(simple_data_dir, EVAL_7B_TEMPLATE)

elif query == '!eval_complex':
if complex_data_dir is None:
ValueError('No complex data directory provided. Please provide a directory with complex eval data')
if model_name == 'mpt-30b-chat':
return chatbot.evaluate_complex(complex_data_dir, EVAL_30B_TEMPLATE)
elif model_name == 'mpt-7b':
return chatbot.evaluate_complex(complex_data_dir, EVAL_7B_TEMPLATE)

if chat_version == 1:
return chatbot.sub_query_chat(query)
elif chat_version == 2:
Expand All @@ -141,11 +196,14 @@ def gradio_chat():
args = parse_args()
main(
endpoint_url=args.endpoint_url,
model_name=args.model_name,
max_length = args.max_length,
chunk_size = args.chunk_size,
chunk_overlap = args.chunk_overlap,
retrieval_k = args.retrieval_k,
model_k = args.model_k,
repository_urls = args.repository_urls,
complex_data_dir = args.complex_data_dir,
simple_data_dir = args.simple_data_dir,
chat_version = args.complex_chat
)
Loading

0 comments on commit 7245cae

Please sign in to comment.