Android Malware Detection using Function Call Graphs and Graph Convolutional Networks
A research work carried out by me (Vinayaka K V) during MTech (Research) degree in Department of IT, NITK.
The objectives of the research were:
- To evaluate whether GCNs were effective in detecting Android Malware using FCGs, and which GCN algorithm is best for this task.
- To enhance the FCGs by incorporating the callback information obtained from the framework code, and evaluate them against the normal FCGs
The code achieving first objective is present at master
(current) branch, while the code achiving second objective is present at experiment
branch.
Stored in the /data
folder. Currently, it contains SHA256 of the APKs containing in training and testing splits.
Obtains the histogram of APK sizes, adds APKs wherever there is a huge imbalance between the number of APKs between classes.
Note: The provided dataset is already APK Size balanced 🥳
Implemented in scripts/process_dataset.py
.
The class FeatureExtractors
provides two public methods:
get_user_features()
- Returns 15-bit feature vector for internal methodsget_api_features()
- Returns a one-hot feature vector for external methods
The method process
extracts the FCG and assignes node features.
Balances the dataset so that the node count distribution of the APKs between the classes is exactly the same.
Implemmented in scripts/split_dataset.py
.
Note: The provided dataset is already node-count balanced to ensure reproducibility 🤩
Multi-layer GCN with dense layer at the end.
Implemented in core/model.py
-
Obtain the APKs ug
-
given SHA256 from AndroZoo
-
Build the container (either singularity or docker), and get into its shell
-
Run
scripts/process_dataset.py
[scripts/process_dataset.py] on the downloaded datasetpython process_dataset.py \ --source-dir <source_directory> \ --dest-dir <dest_directory> \ --override # If you want to oveeride existing processed files \ --dry # If you want to perform a dry run
-
Train the model! For configuration, refer to the section below.
python train_model.py
The configuration is achieved using Hydra. Look into config/conf.yaml
for available configuration options.
Any configuration option can be overridden in the command line. As an example, to change the number of convolution layers to 2, invoke the program as
python train_model.py model.convolution_count=2
You can also perform a sweep, for example,
python train_model.py \
model.convolution_count=0,1,2,3 \
model.convolution_algorithm=GraphConv, SAGEConv, TAGConv, SGConv, DotGatConv \
features=degree, method_attributes, method_summary
to train the model in all possible configurations! 🥳
androguard
- For FCG extraction and Feature assignmentpytorch
- for Neural networksdgl
- for GCN modulespytorch-lightning
- for organization and pipeline 💖hydra
- for configuring experimentswandb
- for tracking experiments 🔥
The research paper corresponding to this work is available at IEEE Xplore. If you find this work helpful and use it, please cite it as
@INPROCEEDINGS{9478141,
author={V, Vinayaka K and D, Jaidhar C},
booktitle={2021 2nd International Conference on Secure Cyber Computing and Communications (ICSCCC)},
title={Android Malware Detection using Function Call Graph with Graph Convolutional Networks},
year={2021},
volume={},
number={},
pages={279-287},
doi={10.1109/ICSCCC51823.2021.9478141}
}