put placeholders for eval docs

alan-turing-institute · Aug 30, 2024 · 59d8f4b · 59d8f4b
1 parent e77a87d
commit 59d8f4b
Show file tree

Hide file tree

Showing 9 changed files with 39 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -26,6 +26,8 @@
 
 `prompto` derives from the Italian word "_pronto_" which means "_ready_". It could also mean "_I prompt_" in Italian (if "_promptare_" was a verb meaning "_to prompt_").
 
+A pre-print for this work is available on [arXiv](https://arxiv.org/abs/2408.11847). If you use this library, please see the [citation](#citation) below. For the experiments in the pre-print, see the [system demonstration examples](./examples/system-demo/README.md).
+
 ## Why `prompto`?
 
 The benefit of  _asynchronous querying_ is that it allows for multiple requests to be sent to an API _without_ having to wait for the LLM's response, which is particularly useful to fully utilise the rate limits of an API. This is especially useful when an experiment file contains a large number of prompts and/or has several models to query. [_Asynchronous programming_](https://docs.python.org/3/library/asyncio.html) is simply a way for programs to avoid getting stuck on long tasks (like waiting for an LLM response from an API) and instead keep running other things at the same time (to send other queries).
@@ -201,3 +203,14 @@ The library has a few key classes:
 * [`AsyncAPI`](https://github.com/alan-turing-institute/prompto/blob/main/src/prompto/apis/base.py): this is the base class for querying all APIs. Each API/model should inherit from this class and implement the `query` method which will (asynchronously) query the model's API and return the response. When running an experiment, the `Experiment` class will call this method for each experiment to send requests asynchronously.
 
 When a new model is added, you must add it to the [`API`](https://github.com/alan-turing-institute/prompto/blob/main/src/prompto/apis/base.py) dictionary which is in the `apis` module. This dictionary should map the model name to the class of the model. For details on how to add a new model, see the [guide on adding new APIs and models](./docs/add_new_api.md).
+
+## Citation
+
+```
+@article{chan2024prompto,
+  title={Prompto: An open source library for asynchronous querying of LLM endpoints},
+  author={Chan, Ryan Sze-Yin and Nanni, Federico and Brown, Edwin and Chapman, Ed and Williams, Angus R and Bright, Jonathan and Gabasova, Evelina},
+  journal={arXiv preprint arXiv:2408.11847},
+  year={2024}
+}
+```
diff --git a/docs/README.md b/docs/README.md
@@ -15,6 +15,7 @@ To view this documentation in a more readable format, visit the [prompto documen
 * [Configuring environment variables](./environment_variables.md)
 * [prompto commands](./commands.md)
 * [Specifying rate limits](./rate_limits.md)
+* [Using prompto for evaluation](./evaluation.md)
 
 ## Reference
 

diff --git a/docs/about.md b/docs/about.md
@@ -3,3 +3,17 @@
 `prompto` is a Python library written by the [Research Engineering Team (REG)](https://www.turing.ac.uk/work-turing/research/research-engineering-group) at the [Alan Turing Institute](https://www.turing.ac.uk/). It was originally written by [Ryan Chan](https://github.com/rchan26), [Federico Nanni](https://github.com/fedenanni) and [Evelina Gabasova](https://github.com/evelinag).
 
 The library is designed to facilitate the running of language model experiments stored as jsonl files. It automates querying API endpoints and logs progress asynchronously. The library is designed to be extensible and can be used to query different models.
+
+## Citation
+
+A pre-print for this work is available on [arXiv](https://arxiv.org/abs/2408.11847).
+
+Please cite the library as:
+```
+@article{chan2024prompto,
+  title={Prompto: An open source library for asynchronous querying of LLM endpoints},
+  author={Chan, Ryan Sze-Yin and Nanni, Federico and Brown, Edwin and Chapman, Ed and Williams, Angus R and Bright, Jonathan and Gabasova, Evelina},
+  journal={arXiv preprint arXiv:2408.11847},
+  year={2024}
+}
+```
diff --git a/docs/commands.md b/docs/commands.md
@@ -90,7 +90,7 @@ In `judge`, you must have two files:
 * `template.txt`: this is the template file which contains the prompts and the responses to be scored. The responses should be replaced with the placeholders `{INPUT_PROMPT}` and `{OUTPUT_RESPONSE}`.
 * `settings.json`: this is the settings json file which contains the settings for the judge(s). The keys are judge identifiers and the values are the "api", "model_name", "parameters" to specify the LLM to use as a judge (see the [experiment file documentation](experiment_file.md) for more details on these keys).
 
-See for example [this judge example](./../examples/judge/) which contains example template and settings files.
+See for example [this judge example](./../examples/evaluation/judge/) which contains example template and settings files.
 
 The judge specified with the `--judge` flag should be a key in the `settings.json` file in the judge location. You can create different judge files using different LLMs as judge by specifying a different judge identifier from the keys in the `settings.json` file.
 

diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -0,0 +1,8 @@
+# Evaluation
+
+A common use case for `prompto` is to evaluate the performance of different models on a given task where we first need to obtain a large number of responses.
+In `prompto`, we provide functionality to automate the querying of different models and endpoints to obtain responses to a set of prompts and _then evaluate_ these responses.
+
+## Automatic evaluation using an LLM-as-a-judge
+
+## Automatic evaluation using a scoring function
diff --git a/examples/judge/settings.json → examples/evaluation/judge/settings.json b/examples/judge/settings.json → examples/evaluation/judge/settings.json
diff --git a/examples/judge/template.txt → examples/evaluation/judge/template.txt b/examples/judge/template.txt → examples/evaluation/judge/template.txt
diff --git a/examples/system-demo/README.md b/examples/system-demo/README.md
@@ -1,6 +1,6 @@
 # System Demonstration examples
 
-We provide some illustrative examples of how to use `prompto` and compare it against traditional a synchronous approach to querying LLM endpoints.
+We provide some illustrative examples of how to use `prompto` and compare it against traditional a synchronous approach to querying LLM endpoints. These experiments are analysed in our systems demonstration paper currently available as a pre-print on [arXiv](https://arxiv.org/abs/2408.11847).
 
 We sample prompts from the instruction-following data following the Self-Instruct approach of [1] and [2]. We take a sample of 100 prompts from the [instruction-following data](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) from [2] and apply the same prompt template. We then use these as prompt inputs to different models using `prompto`. See the [Generating the prompts for experiments](./alpaca_sample_generation.ipynb) notebook for more details.
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -41,6 +41,7 @@ nav:
       - Running experiments and the pipeline: docs/pipeline.md
       - prompto commands: docs/commands.md
       - Specifying rate limits: docs/rate_limits.md
+      - Using prompto for evaluation: docs/evaluation.md
   - Implemented APIs:
       - APIs overview: docs/models.md
       - Azure OpenAI: docs/azure_openai.md