This project implements a Random Forest model for classifying malware images based on visual features. The dataset consists of images from various malware families, and the task is to classify each image into its corresponding malware category. The model processes image data, extracts features, and uses a Random Forest classifier to predict the class label. This approach allows for effective malware detection based on image patterns.
The Random Forest Classifier achieves a high accuracy score, demonstrating its ability to classify malware images effectively.
Accuracy: 80% (calculated from the validation set)
The confusion matrix visualizes the performance of the classifier by comparing predicted and true labels across different classes. The confusion matrix below displays how well the model differentiates between various malware categories:
The classification report provides detailed performance metrics such as precision, recall, and F1-score for each class, offering insight into the model's ability to identify malware categories.
Class Name | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
AgentTesla | 0.67 | 0.96 | 0.79 | 7797 |
Benign | 0.78 | 0.28 | 0.41 | 105 |
CoinMinerXMRig | 1.00 | 0.48 | 0.65 | 27 |
Danabot | 0.77 | 0.84 | 0.81 | 212 |
Dridex | 0.97 | 0.94 | 0.96 | 324 |
Formbook | 0.70 | 0.35 | 0.47 | 3588 |
Gh0stRAT | 1.00 | 0.11 | 0.20 | 37 |
Glupteba | 0.88 | 0.61 | 0.72 | 62 |
Gozi | 1.00 | 0.84 | 0.91 | 358 |
Heodo | 0.99 | 0.99 | 0.99 | 8392 |
NanoCore | 0.89 | 0.20 | 0.33 | 990 |
Quakbot | 1.00 | 0.99 | 0.99 | 734 |
RecordBreaker | 0.05 | 0.06 | 0.05 | 213 |
RedLineStealer | 0.07 | 0.06 | 0.06 | 241 |
Remcos | 0.83 | 0.34 | 0.48 | 980 |
Tinba | 1.00 | 0.96 | 0.98 | 27 |
Trickbot | 1.00 | 0.91 | 0.95 | 832 |
Zeus | 1.00 | 0.20 | 0.33 | 82 |
Accuracy | 0.80 | 25001 | ||
Macro avg | 0.81 | 0.56 | 0.62 | 25001 |
Weighted avg | 0.82 | 0.80 | 0.78 | 25001 |
- Image Preprocessing: The images are resized to a standard dimension of
64x64
pixels and normalized (values scaled between 0 and 1). - Model: The Random Forest model is used for classification, with 100 estimators (trees).
- Feature Extraction: The image data is reshaped into a flat vector before being fed into the model.
- Performance Metrics: The model is evaluated using accuracy, a confusion matrix, and a classification report.
- Sprint 1 - Data Preprocessing: Loaded and resized the images, normalized pixel values, and split the dataset into training and validation sets.
- Sprint 2 - Model Training: Trained the Random Forest classifier on the preprocessed data.
- Sprint 3 - Model Evaluation: Evaluated model performance using accuracy, confusion matrix, and classification report.
This project demonstrates the successful application of a Random Forest classifier for malware image classification. The model achieved a high level of accuracy and performed well across various malware categories. Future work could involve experimenting with other models (e.g., Convolutional Neural Networks) to further improve classification performance.