In this repository you will find a Python implementation of the methods in the paper Applying IoTDevID to a New Dataset: the CIC IoT Dataset 2022 Case Study.
In the era of rapid IoT device proliferation, recognizing, diagnosing, and securing these devices are crucial tasks. The IoTDevID method (IEEE Internet of Things ’22) proposes a machine learning approach for device identification using network packet features. In this article we present a validation study of the IoTDevID method by testing core components, namely its feature set and its aggregation algorithm, on a new dataset. The new dataset (CIC IoT Dataset 2022) offers several advantages over earlier datasets, including a larger number of devices, multiple instances of the same device, both IP and non-IP device data, normal (benign) usage data, and diverse usage profiles, such as active and idle states. Using this independent dataset, we explore the validity of IoTDevID’s core components, and also examine the impacts of the new data on model performance. Our results indicate that data diversity is important to model performance. For example, models trained with active usage data outperformed those trained with idle usage data, and multiple usage data similarly improved performance. Results for IoTDevID were strong with a 92.50 F1 score for 31 IP-only device classes, similar to our results on previous datasets. In all cases, the IoTDevID aggregation algorithm improved model performance. For non-IP devices we obtained a 78.80 F1 score for 40 device classes, though with much less data, confirming that data quantity is also important to model performance.
Fig 1 - A brief overview of the IoTDevID methodology.
Wireshark and Python 3.10 were used to create the application files. Before running the files, it must be ensured that Wireshark, Python 3.10+ and the following libraries are installed.
Library | Task |
---|---|
Scapy | Packet(Pcap) crafting |
tshark | Packet(Pcap) crafting |
Sklearn | Machine Learning & Data Preparation |
Numpy | Mathematical Operations |
Pandas | Data Analysis |
Scipy | Data Analysis, Mathematical Operations |
Matplotlib | Graphics and Visuality |
Seaborn | Graphics and Visuality |
tabulate | Pretty-print tabular data output |
tqdm | Progress meter |
The technical specifications of the computer used for experiments are given below.
Central Processing Unit | : | 12th Gen Intel(R) Core(TM) i7-12700H 2.30 GHz |
Random Access Memory | : | 16.0 GB (15.7 GB usable) |
Operating System | : | Windows 11 Home |
Using the CIC IoT Dataset 2022 data, feature extraction was performed, and the feature sets obtained were used in different ways at different stages of the study as indicated in the following table.
Data | Description |
---|---|
PCAP Files | Raw Network data, Input of Feature Extraction - Used in Section 3 |
All Sessions [54 CSV] | Output of Feature Extraction, Used in Section 4.1 |
AA, AI, IA, II | Merged Sessions |
AA, AI, IA, II %10 sample | Size reduced merged sessions - Used in Section 4.2/4.3 |
AA+non-IP Devices | Size reduced AA with Non-IP/Zigbee data - Used in Section 4.4 |
We used jupyter notebook (ipynb) to present the codes. The file with the ipynb extension has the advantage of saving the state of the last run of that file and the screen output. Thus, screen output can be seen without re-running the files. Files with the ipynb extension can be run using jupyter notebook.
-
01.0 - Features_Extraction: This file convert the files with pcap extension to single packet-based, CSV extension fingerprint files and creates the labeling.
-
01.1 - Unknown-MAC-cleaning: This file removes fingerprints other than known MAC addresses. These fingerprints are unlabelled because their MAC addresses are unknown.
-
01.2 - Creating_smaller_DF_with_Selected_features: In feature extraction, about 100 features are created. However, we will not use most of these features. This file reduces the file size by removing the features we don't use.
-
01.3 - Creating Session_ID.ipynb: This file assigns an identification number to each session to indicate which sessions have the same devices. And it collects devices of the same brand and model under one label, for example: Teckin Plug 1 / Teckin Plug 2 --> Teckin Plug
-
02.1 - CIC results with Session ID vs Session ID: It uses sessions with the same ID number as training and testing data and classifies them with the DT model.
-
02.2 - CIC results with Session ID vs Session ID_aggregated: It uses sessions with the same ID number as training and testing data and classifies them with the DT model. It improves the results using the aggregation algorithm.
-
02.3 - Heatmap of session results: Displays the results of the classification operation in the previous step on a heatmap in terms of F1 score.
-
02.4 - Statistics of class-based results - failed device classes: This file gives statistics on the distribution of Idle-Active pairs and the most failing devices as the class base results.
- 03.0 - Split_training_testing: Combines sessions for broader representation in training and testing datasets. Each newly created dataset is then as small as 10% of its size.
- 03.1 - Hyperparameter Optimization :In this file, hyperparameter optimization is applied via sklearn-Randomizedsearch to DT model.
- 03.2 - General evaluation of the all sessions: In this file, results are obtained for the Idle and Active datasets using individual, and aggregated methods. A group size of 13 was used in the aggregation operations.
- 04.0 - Preprocessing other data: Non-IP devices are filtered from Power and Interactions sessions and added to Active training and testing datasets.
- 04.1 - General evaluation with other data: In this file, results are obtained for the Idle and Active datasets using individual, and aggregated methods with Non-IP devices. The group size of 13 was used in the aggregation operations.
This project is licensed under the MIT License - see the LICENSE file for details
If you use the source code please cite the following paper:
@misc{kostas2023CIC,
title={Externally validating the {IoTDevID} device identification methodology},
author={Kahraman Kostas and Mike Just and Michael A. Lones},
year={2023},
eprint={https://arxiv.org/abs/2307.08679},
archivePrefix={arXiv},
primaryClass={cs.CR}
}
Contact: Kahraman Kostas kahramankostas@gmail.com