RedCode: Risky Code Execution and Generation Benchmark for Code Agents

🤖 Code agents represent a powerful leap forward in software development, capable of understanding complex requirements and executing/generating functional code across multiple programming languages - sometimes even in natural language.

⚠️ But code agents are not risk-free.

In this work, we propose RedCode, a high-quality, large-scale (over 4,000 test cases) dataset that features diverse languages and formats (Python, Bash, natural language), providing real interaction with systems and fine-grained evaluation of both code execution and generation, aiming to rigorously and comprehensively evaluate the safety of code agents.

RedCode consists of RedCode-Exec and RedCode-Gen.

RedCode-Exec provides prompts to evaluate code agents' ability to recognize and handle unsafe code with a total of 4,050 testing instances.
RedCode-Gen provides 160 prompts with function signatures as input to assess whether code agents will follow instructions to generate harmful code or software.

For the safety leaderboard and more visualized results, please consider visiting our RedCode webpage.

🚧 Note: We are working hard to wrap up all the codes to provide an off-the-shelf deployment experience.

To stay updated, consider starring⭐️ and watching😎 this repository. Your support means a lot to us!

📂 Repository Structure

Dataset

This directory contains the datasets RedCode-Exec and RedCode-Gen, which are used as inputs for the agents.

Environment

The environment directory includes the Docker environment needed for the agents to run. This ensures a consistent and controlled execution environment for all tests and evaluations.

Evaluation

The evaluation directory contains subdirectories for the evaluation of three types of agents:

CA-evaluation: Evaluation scripts and resources for CodeAct agents.
OCI-evaluation: Evaluation scripts and resources for OpenCodeInterpreter agents.
RA-evaluation: Evaluation scripts and resources for ReAct agents.

Additionally, evaluation.py that serve as evaluation scripts for each risky scenario.

Result

The result directory stores the results of the evaluations.

Scripts

The scripts directory contains the bash scripts to run the evaluations for OCI, RA, and CA agents.

🎯 Getting Started

Follow these steps to set up the project locally.

Installation

Clone this GitHub repo:

git clone https://github.com/AI-secure/RedCode.git

Environment Setup

The environment.yml file lists all dependencies required for the project. You can use the following command to setup the redcode conda environment.

conda env create -f environment.yml
conda activate redcode

Usage

./scripts/OCI_eval.sh
./scripts/RA_eval.sh
./scripts/CA_eval.sh

Currently, the scripts are run separately. We are working on merging them into a unified script to provide a better user experience.

📚 Citation

If you find our work helpful, please consider citing it as follows:

@article{guo2024redcode,
  title={RedCode: Risky Code Execution and Generation Benchmark for Code Agents},
  author={Guo, Chengquan and Liu, Xun and Xie, Chulin and Zhou, Andy and Zeng, Yi and Lin, Zinan and Song, Dawn and Li, Bo},
  booktitle={Thirty-Eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024}
}

📖 Questions

Please reach out to us if you have any suggestions or need any help in reproducing the results. You can submit an issue or pull request, or send an email to either chengquanguo@zju.edu.cn, xunliu.aq@gmail.com or chulinx2@illinois.edu. Thanks for your attention!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
dataset		dataset
environment		environment
evaluation		evaluation
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Table of Contents

📂 Repository Structure

Dataset

Environment

Evaluation

Result

Scripts

🎯 Getting Started

Installation

Environment Setup

Usage

📚 Citation

📖 Questions

About

Releases

Packages

Contributors 3

Languages

AI-secure/RedCode

Folders and files

Latest commit

History

Repository files navigation

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Table of Contents

📂 Repository Structure

Dataset

Environment

Evaluation

Result

Scripts

🎯 Getting Started

Installation

Environment Setup

Usage

📚 Citation

📖 Questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages