This repository contains the dataset used in the work "Enhancing AI-based Generation of Software Exploits with Contextual Information", accepted for publication at the 35th IEEE Internation Symposium on Software Reliability Engineering (ISSRE'24).
The dataset was used to finetune NMT models to generate offensive security code from natural language descriptions. It comprises real shellcodes to evaluate the models across various scenarios, including missing information, necessary context, and unnecessary context.
Each line represents an intent-snippet pair. The intent is a code description in the English language, while the snippet is a line of assembly code, built by following the NASM syntax. The dataset is divided as follows:
[filename].in
, containing the NL code descriptions;[filename].out
, containing the assembly code snippets.
To assess the models' resilience against incomplete descriptions and their ability in leveraging contextual information for enhanced accuracy, we described the code snippets by using the following notations:
-
No Context: Instructions without added contextual information, forming the baseline for model performance assessment. This category includes
$963$ lines, making up approximately$44%$ of the dataset. -
2to1 Context: Instructions incorporating the immediate preceding instruction to provide context, accounting for
$360$ lines or roughly$17%$ of the dataset. -
3to1 Context: Instructions further extending context by including the two instructions preceding the current one, comprising
$238$ lines or about$11%$ of the dataset. -
2to1 Unnecessary Context: Samples incorporating previous instructions that do not logically link to the current task, this scenario encompasses
$303$ lines or$14%$ of the dataset. -
3to1 Unnecessary Context: Similar to 2to1 Unnecessary Context but with two preceding instructions, also making up
$303$ lines or$14%$ of the dataset.
For further information, contact us via email: pietro.liguori@unina.it (Pietro) and cristina.improta@unina.it (Cristina).