Download the 3% (33,000) samples from the "Malicious Content Detection Platform Project Homepage" at the following link: http://secml.cs.berkeley.edu/detection_platform/release_tarball.tar.gz
Please contact me for download access for the 1M samples reports: Nir.Rosen@post.idc.ac.il
Extract the cuckoo reports from the tarball file (the files named behaviour_0000 and begin with the word "info").
Each file contains multiple cuckoo reports, and the following script extracts each report to a single report file. The script is setting a minimum syscalls-sequnce-length and the following are the filtered results.Scripts/script1_flatter_cuckoo_reports.py Data/list_of_16K_hashes_with_seq_len_limit.txt
Extract the virus-total (VT) reports from the tarball file (the files named reports_0000 and begin with the word "vhash").
Each file contains multiple VT reports, and the following script extract some fields (sample-hash, malicious-score, sample-first-seen) from each report and store them in a single file: Scripts/script2_vt_to_json.py Data/dict_hash_to_score_time.json
The train-test split is done by considering both malicious-benign-ratio (for simplicity purposes it uses the ratio of 50%-malwares 50%-benign files) within the datasets and time consistency between the datasets. Scripts/script3_notebook_split_by_score_time.ipynb