VarBERT is a BERT-based model which predicts meaningful variable names and variable origins in decompiled code. Leveraging the power of transfer learning, VarBERT can help you in software reverse engineering tasks. VarBERT is pre-trained on 5M human-written source code functions, and then it is fine-tuned on decompiled code from IDA and Ghidra, spanning four compiler optimizations (O0, O1, O2, O3). We built two data sets: (a) Human Source Code data set (HSC) and (b) VarCorpus (for IDA and Ghidra). This work is developed for IEEE S&P 2024 paper "Len or index or count, anything but v1": Predicting Variable Names in Decompilation Output with Transfer Learning
Key Features
- Pre-trained on 5.2M human-written source code functions.
- Fine-tuned on decompiled code from IDA and Ghidra.
- Supports four compiler optimizations: O0, O1, O2, O3.
- Achieves an accuracy of 54.43% for IDA and 54.49% for Ghidra on O2 optimized binaries.
- A total of 16 models are available, covering two decompilers, four optimizations, and two splitting strategies.
- Overview
- VarBERT Model
- Using VarBERT
- Training and Inference
- Data sets
- Installation Instructions
- Cite
This repository contains details on generating a new dataset, and training and running inference on existing VarBERT models from the paper. To use VarBERT models in your day-to-day reverse engineering tasks, please refer to Use VarBERT.
We take inspiration for VARBERT from the concepts of transfer learning generally and specifically Bidirectional Encoder Representations from Transformers (BERT).
- Pre-training: VarBERT is pre-trained on HSC functions using Masked Language Modeling (MLM) and Constrained Masked Language Modeling (CMLM).
- Fine-tuning: VarBERT is then further fine-tuned on top of the previously pre-trained model using VarCorpus (decompilation output of IDA and Ghidra). It can be further extended to any other decompiler capable of generating C-Style decompilation output.
- The VarBERT API is a Python library to access and use the latest models. It can be used in three ways:
- From the CLI, directly on decompiled text (without an attached decompiler).
- As a scripting library.
- As a decompiler plugin with DAILA for enhanced decompiling experience.
For a step-by-step guide and a demo on how to get started with the VarBERT API, please visit VarBERT API.
For training a new model or running inference on existing models, see our detailed guide at Training VarBERT
Models available for download:
(A README containing all the necessary links for the model is also available.)
-
HSC: Collected from C source files from the Debian APT repository, totaling 5.2M functions.
-
VarCorpus: Decompiled functions from C and C++ binaries, built from Gentoo package repository for four compiler optimizations: O0, O1, O2, and O3.
Additionally, we have two splits: (a) Function Split (b) Binary Split.
- Function Split: Functions are randomly distributed between the test and train sets.
- Binary Split: All functions from a single binary are exclusively present in either the test set or the train set. To create a new data, follow detailed instuctions at Building VarCorpus
Data sets available at:
The fine-tuned models and their corresponding datasets are named IDA-O0-Function
and IDA-O0
, respectively. This naming convention indicates that the models and data set are based on functions decompiled from O0 binaries using the IDA decompiler.
Note
Our existing data sets have been generated using IDA Pro 7.6 and Ghidra 10.4.
Prerequisites for training model or generating data set
Linux with Python 3.8 or higher
torch ≥ 1.9.0
transformers ≥ 4.10.0
docker build -t . varbert
pip install -r requirements.txt
# joern requires Java 11
sudo apt-get install openjdk-11-jdk
# Ghidra 10.4 requires Java 17+
sudo apt-get install openjdk-17-jdk
git clone git@github.com:rhelmot/dwarfwrite.git
cd dwarfwrite
pip install .
Note: Ensure you install the correct Java version required by your specific Ghidra version.
TODO