Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jailbreak artifacts cannot be reproduce the attack effect #9

Open
ZHIXINXIE opened this issue Nov 3, 2024 · 0 comments
Open

Jailbreak artifacts cannot be reproduce the attack effect #9

ZHIXINXIE opened this issue Nov 3, 2024 · 0 comments

Comments

@ZHIXINXIE
Copy link

Hi, authors,

Thanks for your great work! I have starred your repo and your work really opens my mind. I love the detailed experiment results and the middle output. However, I meet a problem and don't know what's wrong. So I write this issue to kindly ask you to help me. Thanks in advance!

What do I do: I meet a problem when I want to reproduce the attack effect. My test is quiet straightforward: I directly use the prompt in the 26th line of jailbreak_artifacts/exps_llama2_7b.json and try to attack llama2-7b-chat.
image

As I don't know whether you use the recommended system prompt when attacking the llama2-7b-chat, I test the attack both with and without the system prompt. I test each for 50 times and count for the success times. I attach the test code and result (in txt format) in this issues. The code is quiet easy. The system prompt and the user prompt are as follow:
image

The test code are as follow:
image

If you want to reproduce my test, you can run the test_code.
log_file_withoutsystemprompt.txt
log_file_withsystemprompt.txt
test_code.txt

What's the issues: I find the attack success rate is very low (which is inconsistant with the result in your paper). In my 100 times tests, only 3 times the attack success. I don't know what's wrong about it.

What's my problem:
1/ I want to ask that how did you measure ASR in your paper? Is it “using one prompt to attack the model for the same target (e.g., ‘how to make a bomb’) 100 times and counting the number of successful attempts” or “optimizing the 100 prompts (and each prompt for one target) and count how many prompt can attack the model for the certain target at least success one times”?

2/ How do you test your prompt? Is my test way shown in the code correct?

What's my environment: transformers 4.46.1, fschat 0.2.23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant