Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to pass in multiple template text files for LLM-as-judge eval #99

Merged
merged 15 commits into from
Sep 11, 2024
Merged
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,9 @@ repos:
- id: isort
name: isort (python)
args: ["--profile", "black", "--filter-files"]

- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
hooks:
- id: codespell
args: ["--skip", "*.jsonl,*.json,examples/system-demo/alpaca_sample_generation.ipynb"]
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@

`prompto` is a Python library which facilitates processing of experiments of Large Language Models (LLMs) stored as jsonl files. It automates _asynchronous querying of LLM API endpoints_ and logs progress.

`prompto` derives from the Italian word "_pronto_" which means "_ready_". It could also mean "_I prompt_" in Italian (if "_promptare_" was a verb meaning "_to prompt_").
`prompto` derives from the Italian word "_pronto_" which means "_ready_" (or "hello" when answering the phone). It could also mean "_I prompt_" in Italian (if "_promptare_" was a verb meaning "_to prompt_").

A pre-print for this work is available on [arXiv](https://arxiv.org/abs/2408.11847). If you use this library, please see the [citation](#citation) below. For the experiments in the pre-print, see the [system demonstration examples](./examples/system-demo/README.md).

Expand All @@ -41,6 +41,10 @@ For more details on the library, see the [documentation](./docs/README.md) where

See below for [installation instructions](#installation) and [quickstarts for getting started](#getting-started) with `prompto`.

## `prompto` for Evaluation

`prompto` can also be used as an evaluation tool for LLMs. In particular, it has functionality to automatically conduct an LLM-as-judge evaluation on the outputs of models and/or apply a `scorer` function (e.g. string matching, regex, or any custom function applied to some output) to outputs. For details on how to use `prompto` for evaluation, see the [evaluation docs](./docs/evaluation.md).

## Available APIs and Models

The library supports querying several APIs and models. The following APIs are currently supported are:
Expand Down Expand Up @@ -130,7 +134,7 @@ prompto_run_experiment --file data/input/openai.jsonl --max-queries 30
This will:

1. Create subfolders in the `data` folder (in particular, it will create `media` (`data/media`) and `output` (`data/media`) folders)
2. Create a folder in the`output` folder with the name of the experiment (the file name without the `.jsonl` extention * in this case, `openai`)
2. Create a folder in the`output` folder with the name of the experiment (the file name without the `.jsonl` extension * in this case, `openai`)
3. Move the `openai.jsonl` file to the `output/openai` folder (and add a timestamp of when the run of the experiment started)
4. Start running the experiment and sending requests to the OpenAI API asynchronously which we specified in this command to be 30 queries a minute (so requests are sent every 2 seconds) * the default is 10 queries per minute
5. Results will be stored in a "completed" jsonl file in the output folder (which is also timestamped)
Expand Down
27 changes: 15 additions & 12 deletions docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,14 @@ Note that if the experiment file is already in the input folder, we will not mak

### Automatic evaluation using an LLM-as-judge

It is possible to automatically run a LLM-as-judge evaluation of the responses by using the `--judge-location` and `--judge` arguments of the CLI. See the [Create judge file](#create-judge-file) section for more details on these arguments.
It is possible to automatically run a LLM-as-judge evaluation of the responses by using the `--judge-folder` and `--judge` arguments of the CLI. See the [Create judge file](#create-judge-file) section for more details on these arguments.

For instance, to run an experiment file with automatic evaluation using a judge, you can use the following command:
```
prompto_run_experiment \
--file path/to/experiment.jsonl \
--data-folder data \
--judge-location judge \
--judge-folder judge \
--judge gemini-1.0-pro
```

Expand Down Expand Up @@ -75,28 +75,31 @@ prompto_check_experiment \

## Create judge file

Once an experiment has been ran and responses to prompts have been obtained, it is possible to use another LLM as a "judge" to score the responses. This is useful for evaluating the quality of the responses obtained from the model. To create a judge file, you can use the `prompto_create_judge` command passing in the file containing the completed experiment and to a folder (i.e. judge location) containing the judge template and settings to use. To see all arguments of this command, run `prompto_create_judge --help`.
Once an experiment has been ran and responses to prompts have been obtained, it is possible to use another LLM as a "judge" to score the responses. This is useful for evaluating the quality of the responses obtained from the model. To create a judge file, you can use the `prompto_create_judge_file` command passing in the file containing the completed experiment and to a folder (i.e. judge folder) containing the judge template and settings to use. To see all arguments of this command, run `prompto_create_judge_file --help`.

To create a judge file for a particular experiment file with a judge-location as `./judge` and using judge `gemini-1.0-pro` you can use the following command:
To create a judge file for a particular experiment file with a judge-folder as `./judge` and using judge `gemini-1.0-pro` you can use the following command:
```
prompto_create_judge \
prompto_create_judge_file \
--experiment-file path/to/experiment.jsonl \
--judge-location judge \
--judge-folder judge \
--templates template.txt \
--judge gemini-1.0-pro
```

In `judge`, you must have two files:
In `judge`, you must have the following files:

* `template.txt`: this is the template file which contains the prompts and the responses to be scored. The responses should be replaced with the placeholders `{INPUT_PROMPT}` and `{OUTPUT_RESPONSE}`.
* `settings.json`: this is the settings json file which contains the settings for the judge(s). The keys are judge identifiers and the values are the "api", "model_name", "parameters" to specify the LLM to use as a judge (see the [experiment file documentation](experiment_file.md) for more details on these keys).
* `settings.json`: this is the settings json file which contains the settings for the judge(s). The keys are judge identifiers and the values dictionaries with "api", "model_name", "parameters" keys to specify the LLM to use as a judge (see the [experiment file documentation](experiment_file.md) for more details on these keys).
* template `.txt` file(s) which specifies the template to use for the judge. The inputs and outputs of the completed experiment file are used to generate the prompts for the judge. This file should contain the placeholders `{INPUT_PROMPT}` and `{OUTPUT_RESPONSE}` which will be replaced with the inputs and outputs of the completed experiment file (i.e. the corresponding values to the `prompt` and `response` keys in the prompt dictionaries of the completed experiment file).

See for example [this judge example](./../examples/evaluation/judge/) which contains example template and settings files.
For the template file(s), we allow for specifying multiple templates (for different evaluation prompts), in which case the `--templates` argument should be a comma-separated list of template files. By default, this is set to `template.txt` if not specified. In the above example, we explicitly pass in `template.txt` to the `--templates` argument, so the command will look for a `template.txt` file in the judge folder.

The judge specified with the `--judge` flag should be a key in the `settings.json` file in the judge location. You can create different judge files using different LLMs as judge by specifying a different judge identifier from the keys in the `settings.json` file.
See for example [this judge example](https://github.com/alan-turing-institute/prompto/tree/main/examples/evaluation/judge) which contains example template and settings files.

The judge specified with the `--judge` flag should be a key in the `settings.json` file in the judge folder. You can create different judge files using different LLMs as judge by specifying a different judge identifier from the keys in the `settings.json` file.

## Obtain missing results jsonl file

In some cases, you may have ran an experiment file and obtained responses for some prompts but not all. To obtain the missing results jsonl file, you can use the `prompto_obtain_missing_results` command passing in the input experiment file and the corresponding output experiment. You must also specify a path to a new jsonl file which will be created if any prompts are missing in the output file. The command looks at an ID key in the `prompt_dict`s of the input and output files to match the prompts, by default the name of this key is `id`. If the key is different, you can specify it using the `--id` flag. To see all arguments of this command, run `prompto_obtain_missing_results --help`.
In some cases, you may have ran an experiment file and obtained responses for some prompts but not all (e.g. in the case where an experiment was stopped during the process). To obtain the missing results jsonl file, you can use the `prompto_obtain_missing_results` command passing in the input experiment file and the corresponding output experiment. You must also specify a path to a new jsonl file which will be created if any prompts are missing in the output file. The command looks at an ID key in the `prompt_dict`s of the input and output files to match the prompts, by default the name of this key is `id`. If the key is different, you can specify it using the `--id` flag. To see all arguments of this command, run `prompto_obtain_missing_results --help`.

To obtain the missing results jsonl file for a particular experiment file with the input experiment file as `path/to/experiment.jsonl`, the output experiment file as `path/to/experiment-output.jsonl`, and the new jsonl file as `path/to/missing-results.jsonl`, you can use the following command:
```
Expand Down
Loading
Loading