This project focuses on developing and evaluating machine learning models to classify network traffic based on features extracted from datasets. It leverages multiple classification algorithms to predict traffic types, providing insights into model effectiveness. The final models are saved for future predictions and can be easily integrated into real-world applications.
- Introduction
- Project Structure
- Datasets
- Installation
- Usage
- Visualization and Analysis
- Model Saving and Deployment
- Features
- Results
- Conclusion
- License
Network traffic classification is essential in cybersecurity, network management, and Quality of Service (QoS) provisioning. This project utilizes advanced machine learning techniques to classify network traffic into categories. The primary objective is to compare the performance of several models to determine the most effective approach.
- Data Loading and Preprocessing: Scripts for loading, cleaning, and preprocessing network datasets.
- Model Development: Implementations of Random Forest, Gaussian Naive Bayes, Logistic Regression, K-Nearest Neighbors (KNN), and Multi-Layer Perceptron (MLP).
- Evaluation: Model evaluation using accuracy, confusion matrices, ROC curves, and learning curves.
- Visualization: Graphical representations of performance metrics.
- User Input and Prediction: Interface for predicting traffic type based on user input.
- Model Saving: Trained models are saved for deployment.
The project uses multiple datasets sourced from network traffic, including:
- Ping Dataset
- Voice Dataset
- DNS Dataset
- Telnet Dataset
These datasets are loaded from Google Drive, concatenated, and preprocessed to form a unified dataset for training and evaluation.
# Clone This Repo
https://github.com/Zhaxstronaut/Machine-Learning.git
# Go to Directory
cd Machine-Learning
Ensure that you have Python 3.x installed. Then, install the required packages using pip:
pip install -r requirements.txt
The datasets are automatically downloaded and loaded into the project from Google Drive using the provided URLs in the code. No manual intervention is required.
Data preprocessing is a crucial step that includes cleaning the datasets, handling missing values, and standardizing the features. The following steps are carried out:
- Concatenation: The datasets are merged into a single dataframe.
- Feature Engineering: Irrelevant features such as 'Forward Packets', 'Forward Bytes', 'Reverse Packets', and 'Reverse Bytes' are dropped.
- Label Encoding: The target variable 'Traffic Type' is encoded into categorical codes.
The project trains and evaluates several machine learning models:
- Model Initialization:
- Random Forest Classifier
- Gaussian Naive Bayes
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Multi-Layer Perceptron (MLP)
- Training:
- The dataset is split into training and testing sets.
- Each model is trained on the training set.
- Evaluation:
- Models are evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC AUC.
- Confusion matrices and ROC curves are generated for a detailed analysis of model performance.
- Learning curves are plotted to observe the model's performance as the training data size increases.
The project includes a user interface to input new data for traffic classification. The following steps are followed:
- User Input: Users are prompted to input values for specific features.
- Preprocessing: The user input is preprocessed using the same scaler applied during model training.
- Prediction: Each model provides a prediction for the traffic type based on the input data.
The project provides various visual tools to understand model performance:
- Confusion Matrix: Visual representation of the true vs. predicted classifications.
- ROC Curve: Displays the trade-off between the true positive rate and false positive rate for each class.
- Learning Curve: Shows the performance of the model on the training and validation sets as the amount of training data varies.
The trained models are saved using Python’s pickle module, allowing for easy deployment in production environments. This feature is particularly useful for integrating the classification models into larger systems or applications where network traffic classification is required
- Comprehensive Preprocessing: Includes handling of missing data, feature selection, and scaling.
- Multiple Classifiers: Implements and compares five different machine learning models.
- Detailed Evaluation: Provides a thorough evaluation using multiple metrics and visualizations.
- User Interaction: Allows for dynamic predictions based on user input.
- Model Persistence: Saves models for future use, making deployment straightforward.
The project outputs detailed results for each model, including:
- Accuracy: Overall accuracy of each model on the test dataset.
- Classification Reports: Detailed performance metrics for each traffic type.
- Confusion Matrices: Provides insights into model misclassifications.
- ROC Curves: Evaluates the model’s ability to distinguish between different traffic types.
- Learning Curves: Shows how each model’s performance scales with the amount of training data.
This project successfully demonstrates the application of machine learning to network traffic classification. By comparing multiple models, we gain insights into the strengths and weaknesses of different approaches. The saved models are ready for deployment and can be used in real-time traffic classification systems.
This project is licensed under the MIT License. You are free to use, modify, and distribute the code as per the terms of the license. See the LICENSE file for details.