Fluent dreaming for language models.

The code here implements the discrete prompt optimization algorithms in the paper "Fluent student-teacher redteaming".

Please also see the companion page that demonstrates using the code here.

The demo.ipynb file here is the source for that companion page.

Key modules:

flrt.attack: The main attack entrypoint including the AttackConfig object.
flrt.victim: Code for managing attack "victims" - the model that will be forced to misbehave.
flrt.templates: Attack templates specifying which subset of the prompt can be optimized by the discrete optimization.
flrt.util: Tools for loading models and tokenizers and generating.

The remaining code is either internal to the algorithm (flrt.objective, flrt.operators) or is scaffolding for running on Modal (flrt.modal_defs, flrt.modal_download) or running evaluations (flrt.judge).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Fluent dreaming for language models.

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fluent dreaming for language models.