This code was created for a project for CMPE-789 Machine Learning for Cybersecurity Analytics. The code allows for testing of transfer learning on an intrusion detection dataset. For more details see the included presentation with an overview and results.
Download the CIC-IDS-2017 and/or the CIC-IDS-2018 dataset to use this code.
CIC-IDS-2018: link
CIC-IDS-2017: link
The code could be adapted for other datasets, but this requires further work.
You may use the provided Dockerfile to build a container with all of the necessary requirements required to run the provided code. However, you must have some version of CUDA, Docker and the NVIDIA container toolkit installed (see link).
Otherwise, feel free to set up the environment in whatever way you want.
- Copy Dockerfile template
$ cp Dockerfile Dockerfile.new
- Update lines 4 and 18 with desired author info and username
- Update .dockerignore with any added directories as necessary
- Build the Docker container
$ docker build -t <desired-tag> -f Dockerfile.new .
- Run Docker Container.
<tag>
must be the same tag used in step 4.docker run -it --gpus all --shm-size=25G -e HOME=$HOME -e USER=$USER -v $HOME:$HOME -w $HOME --user <created-user> <tag>
- Navigate to code (Home directories will be linked) and run
The random forest classifier is used as a baseline for the MLP model. The pkl-path argument can point to any empty directory where the preprocessed dataset will be saved to reduce preprocessing time on subsequent runs.
python3 classify_rf.py \
--max-depth=10 \
--data-path=/home/poppfd/data/CIC-IDS2018/Processed_Traffic_Data_for_ML_Algorithms/ \
--dset='cic-2018' \
--pkl-path=/home/poppfd/College/ML_Cyber/ml-project/data
There are two files for the MLP classifier. A training script mlp.py
and a
testing script eval_mlp.py
. Provided here are the sample commands to run
these scripts. These commands will have to be updated
to match your environment.
python3 mlp.py \
--dset=cic-2018 \
--data-root=/home/poppfd/data/CIC-IDS2018/Processed_Traffic_Data_for_ML_Algorithms/ \
--pkl-path=/home/poppfd/College/ML_Cyber/ml-project/data \
--batch-size=32 \
--eval-batch-size=1028 \
--num-epochs=10 \
--warmup-epochs=2 \
--learning-rate=1e-4 \
--min-lr=1e-6 \
--warmup-lr=1e-5 \
--name=train-2018-1
Note that the script is identical to freeze the feature extraction layer.
Update --transfer-learn=freeze-feature
python3 mlp.py \
--dset=cic-2017 \
--data-root=/home/poppfd/data/CIC-IDS2017/MachineLearningCVE \
--pkl-path=/home/poppfd/College/ML_Cyber/ml-project/data \
--batch-size=32 \
--eval-batch-size=1028 \
--num-epochs=10 \
--warmup-epochs=2 \
--learning-rate=1e-4 \
--min-lr=1e-6 \
--warmup-lr=1e-5 \
--transfer-learn=fine-tune \
--source-classes=10 \
--pretrained-path=/home/poppfd/College/ML_Cyber/ml-project/output/mlp-5-layer-3/model_eval_5.pth \
--name=transfer-2017-fine-tune
python3 eval_mlp.py \
--dset='cic-2018' \
--dataset-dir=/home/poppfd/data/CIC-IDS2018/Processed_Traffic_Data_for_ML_Algorithms/ \
--pkl-path=/home/poppfd/College/ML_Cyber/ml-project/data \
--model-path=/home/poppfd/College/ML_Cyber/ml-project/output/transfer-2018-freeze-1/model_eval_23.pth \
--batch-size=1028 \
--name=transfer-2018-freeze-1
The eval_mlp.py
script can also be used to generate t-SNE visualizations of
the MLP feature embedding. For this case since the t-SNE is really slow for a
high number of samples, only a small subset of the evaluation dataset is used
for the t-SNE plots. Therefore, all other output of this run should be ignored
as it does not include the full evaluation dataset.
python3 eval_mlp.py \
--dset='cic-2018' \
--dataset-dir=/home/poppfd/data/CIC-IDS2018/Processed_Traffic_Data_for_ML_Algorithms/ \
--pkl-path=/home/poppfd/College/ML_Cyber/ml-project/data \
--model-path=/home/poppfd/College/ML_Cyber/ml-project/output/transfer-2018-freeze-1/model_eval_23.pth \
--batch-size=1028 \
--tnse \
--tsne-percent=0.01 \
--name=transfer-2018-freeze-1