This is the code repository for Malware Classification Research. All the deep learning models are implemented with Python 3.6+ and PyTorch 1.9.
The source data is the json reports generated by malicious software dynamic analysis system Cuckoo Sandbox. The data was analyzed in order to extract the most useful information about malicious samples. As a result of the analysis, 3698 features were selected, on the basis of which further classification will be carried out. Thus, each instance of malware is assigned a binary feature vector of dimension 3698, the label of which is the result of classification by Kaspersky anti-virus. The database contains about 10,000 labeled samples from 8 different types of malware and about 14,000 unlabeled samples.
The normalized vector of dimension 3698 is represented as an RGB image of the size 61 × 61 (61 ≈ √3698), in which the color of each pixel is set by the value of the corresponding feature.
An autoencoder model with a latent space dimension of 200 was trained on the unlabeled data for further malware classification using pretrained encoder.
AE performance, the first row is input, the second is AE output
Also the autoencoder was trained with the size of the latent space equal to 2 for its subsequent visualization on a two-dimensional plane.
Changing the latent space in the learning process
Labeled malware samples displayed in latent space
Сlassifier results: