Curated collection of prompts that break popular AI assistants leading to unexpected behaviour
Contents:
- Objective
- Test Categories
- Systems under investigation
- Results
To drive an open-source efforts to collect information on how large-scale generative models (hidden behind an API) can be exploited to drive their response away from the expected behavior. The systems being tested should be available publically to ensure equitable access.
The following categories for testing are currently supported, each can have a subsection based on the type of vulnerability, for example mathematics under hallucination.
Definition: Engineering prompts that allow systems to bypass the initial invisible prompt and do as directed by the user.
Criteria: Demonstrate bypass behavior with and without prompt.
Definition: Engineer prompts that leads the system to generate imaginary text based on real-world entities (more useful in cases of assistants that are tailored to work on real-world data).
Criteria: Demonstrate hallucinated information for real-world entities.
Definition: Engineer prompts that leads the system to generate hate-speech/bias against a particular group.
Criteria: Demonstrate the generation of toxic text for particular group.
Definition: Engineer prompts that lead the system to generate self-contradicting behavior.
Criteria: Demonstrate self-contradiction in a multi-turn setting.
Defintion: Engineer prompts that lead the system to generate texts that contain instructions to harm a certain entity. Criteria: Demonstrate the generation of harmful-prompts.
Please ensure that the submitted prompts are not a simple paraphrase or style modification of already existing exploits, as such contributions will be rejected. A template for making a submission can be found here, that should be included in addition to the PR submitted.