These artifacts provide the following four datasets:
- syntia: 500 functions obfuscated with EncodeArithmetic and EncodeData
- custom_EA: 500 functions obfuscated with EncodeArithmetic
- custom_VR_EA: 500 functions obfuscated with Virtualize and EncodeArithmetic
- custom_EA_ED: 500 functions obfuscated with EncodeArithmetic and Virtualize (the database trace is truncated to 293 functions)
Each dataset directory contains the following files:
-
original.c: source file containing all the functions
-
obfuscated.c: obfuscated source file as generated by Tigress
-
run_tigress.sh: Shell script to regenerating the obfuscated.c file
-
ground_truth.json: a JSON file containing the ground truth. For each function it contains the original expression (in original.c) and its obfuscated counter-part in (obfuscated.c)
-
obfuscated: pre-compiled binary as x86_64 (no PIE)
-
trace.db: the execution trace as a SQLite database (likely the most interesting file)
-
results:
- qsynth.log: the raw output of the synthesis script
- qsynth.json: QSynth processable output (very likely the second most interesting file)
The qsynth.json contains for each function the the following entries such as:
{
"target_1": {
"fun_name": "target_1",
"start": 164,
"stop": 238,
"orig_ast": "~(b*d ^ d*d) - b",
"orig_node": 10,
"orig_depth": 5,
"obfu_ast": "(((b & d)*(b | d) + (b & ~d)*(~b & d) & (d & d)*(d | d) + (d & ~d)*(~d & d)) - ((b & d)*(b | d) + (b & ~d)*(~b & d) | (d & d)*(d | d) + (d & ~d)*(~d & d)) - 1 ^ b) - ((~(((b & d)*(b | d) + (b & ~d)*(~b & d) & (d & d)*(d | d) + (d & ~d)*(~d & d)) - ((b & d)*(b | d) + (b & ~d)*(~b & d) | (d & d)*(d | d) + (d & ~d)*(~d & d)) - 1) & b) + (~(((b & d)*(b | d) + (b & ~d)*(~b & d) & (d & d)*(d | d) + (d & ~d)*(~d & d)) - ((b & d)*(b | d) + (b & ~d)*(~b & d) | (d & d)*(d | d) + (d & ~d)*(~d & d)) - 1) & b))",
"obfu_node": 229,
"obfu_depth": 12,
"triton_ast": "(((b & d)*(b | d) + (~b & d)*(~d & b) & d*d) - (d*d | (b & d)*(b | d) + (~b & d)*(~d & b)) - 1 ^ b) - (((d*d ^ (b & d)*(b | d) + (~b & d)*(~d & b)) & b) + ((d*d ^ (b & d)*(b | d) + (~b & d)*(~d & b)) & b))",
"triton_node": 95,
"triton_depth": 10,
"synthesized_ast": "~((d*d ^ d*b) + b)",
"synth_node": 10,
"synth_depth": 5,
"dse_t": 0.10445356369018555,
"synthesis_t": 0.03491616249084473,
"sem_orig_obf": "UNK",
"sem_obf_trit": "UNK",
"sem_orig_synth": "OK",
"is_simplified": true,
"is_fully_synthesized": true
}
}
Start and stop are the offsets in the execution trace (thus id in the DB). Then for each
expression the expression itself and its node size and depth. Then dse_t
, synthesis_t
gives
respectively the symbolic execution time and the synthesis time. is_simplified
and is_fully_synthesized
indicates if the function was simplified and if so if it was fully. sem_orig_obf
, sem_obf_trit
and sem_orig_synth
indicates if the semantic is preserved between for instance original and synthesized
(sem_orig_synth
). "UNK" indicates that it has not been checked or that it yielded a timeout.
The tables used for the benchmarks are available here: https://ret2libc.com/static/various/lts_15/ They are Python pickle objects. Expressions are encoded with a similar Reverse-Polish-Notation (RPN) than Syntia.