berkeley detection platform - data processing

steps to reproduce the process on the Berkeley dataset

1. Download the data

Download the 3% (33,000) samples from the "Malicious Content Detection Platform Project Homepage" at the following link: http://secml.cs.berkeley.edu/detection_platform/release_tarball.tar.gz

Please contact me for download access for the 1M samples reports: Nir.Rosen@post.idc.ac.il

2. Extract the cuckoo reports

Extract the cuckoo reports from the tarball file (the files named behaviour_0000 and begin with the word "info").

3. Process the cuckoo reports

Each file contains multiple cuckoo reports, and the following script extracts each report to a single report file. The script is setting a minimum syscalls-sequnce-length and the following are the filtered results.Scripts/script1_flatter_cuckoo_reports.py Data/list_of_16K_hashes_with_seq_len_limit.txt

4. Extract the VT reports

Extract the virus-total (VT) reports from the tarball file (the files named reports_0000 and begin with the word "vhash").

5. Process the VT reports

Each file contains multiple VT reports, and the following script extract some fields (sample-hash, malicious-score, sample-first-seen) from each report and store them in a single file: Scripts/script2_vt_to_json.py Data/dict_hash_to_score_time.json

6. Train-Test Split

The train-test split is done by considering both malicious-benign-ratio (for simplicity purposes it uses the ratio of 50%-malwares 50%-benign files) within the datasets and time consistency between the datasets. Scripts/script3_notebook_split_by_score_time.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
Notebooks html-files		Notebooks html-files
Scripts		Scripts
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

berkeley detection platform - data processing

steps to reproduce the process on the Berkeley dataset

1. Download the data

2. Extract the cuckoo reports

3. Process the cuckoo reports

4. Extract the VT reports

5. Process the VT reports

6. Train-Test Split

About

Releases

Packages

Languages

nirosen/Malware-classification-of-the-Berkeley-detection-dataset

Folders and files

Latest commit

History

Repository files navigation

berkeley detection platform - data processing

steps to reproduce the process on the Berkeley dataset

1. Download the data

2. Extract the cuckoo reports

3. Process the cuckoo reports

4. Extract the VT reports

5. Process the VT reports

6. Train-Test Split

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages