Skip to content

shettynitis/Jailbreak_ChatGPT3.5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jailbreak_ChatGPT3.5

This repository contains research and implementation for generating adversarial prompts to bypass ChatGPT-3.5's moderation framework using the KOV algorithm. The project aims to evaluate the effectiveness of adversarial suffixes in circumventing moderation systems and explores hyperparameter tuning to optimize this process.

The repository is organized into main components:

  • KOV-main: Contains the implementation of the KOV algorithm and its dependencies.
  • Output: Includes csv files related to the results generated during the evaluation process for Q1, Q2 and Q3
  • Jailbreaking_Starter.ipynb: To run our KOV implementation.
  • harmful_behaviors.csv: Prompts for Q3

Repository Structure

KOV-main

This folder contains all the scripts and utilities required to implement and evaluate the KOV algorithm:

  • DPW_Solver.py: Implements the DPW (Double Progressive Widening) solver for Monte Carlo Tree Search.
  • KOV.py: The main script that orchestrates the KOV algorithm's workflow, including hyperparameter tuning and adversarial prompt generation.
  • WhiteBox.py: Provides the white-box functionality for generating adversarial suffixes and scoring them.
  • utils.py: Contains utility functions used across the implementation.
  • advbench_subset.csv: A subset of adversarial benchmark prompts used for testing.
  • LICENSE: Licensing information for this repository.

Output

This folder contains csv files with results for each subpart asked in the question.

JSON files are stored in a shared Google Drive folder: Google Drive: Output JSON Files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published