Authors:
https://medium.com/@rrohith2001/url-feature-engineering-and-classification-66c0512fb34d
We would like to thank our professor Premjith B for the assistance and guidance.
Pre-requisites : conda and git
Please Note : All System Paths in the scripts, are coded in UNIX OS format, please convert '/' to "\\ " for Windows OS.
git clone https://github.com/Rohith-2/url_classification_dl.git
cd url_classification_dl
conda create -n pyenv python=3.8.5
conda activate pyenv
pip install -r requirements.txt
Feature Extraction :
cd scripts/
python extract_Features.py
The features extracted are explained and visualised in this Notebook. The output training data after feature extraction is labbeled as features.csv under FinalDataset. Feature extraction for each category of URLs took on an average 18-26 hours, which extends the total of 95 hours on an average.
Training:
cd scripts/
python nn_Training.py
The output of the trained model is exported to the models.
Testing:
cd scripts/
python predict_args.py -i <url>
If you only wish to use the pre-trained model, please check releases
Running the GUI locally:
cd GUI/
streamlit run predict.py
All the above commands are from the home(url_classification_dl) folder
Feature Name | Feature Group | Feature Discription |
---|---|---|
URL Entropy |
URL String Characteristics | Entropy of URL |
numDigits |
URL String Characteristics | Total number of digits in URL string |
URL Lenght |
URL String Characteristics | Total number of characters in URL string |
numParameters |
URL String Characteristics | Total number of query parameters in URL |
numFragments |
URL String Characteristics | Total Number of Fragments in URL |
domainExtension |
URL String Characteristics | Domian extension |
num_%20 |
URL String Characteristics | Number of '%20' in URL |
num_@ |
URL String Characteristics | Number of '@' in URL |
has_ip |
URL String Characteristics | Occurence of IP in URL |
hasHTTP |
URL domain features | Website domain has http protocol |
hasHTTPS |
URL domain features | Website domain has http protocol |
urllsLive |
URL domain features | The page is online |
daysSinceRegistration |
URL domain features | Number of days from today since domain was registered |
daysSinceExpired |
URL domain features | Number of days from today since domain expired |
bodyLength |
URL page fratures | Total number of characters in URL's HTML page |
numTitles |
URL page fratures | Total number of HI-H6 titles in URL's HTML page |
numlmages |
URL page fratures | Total number of images embedded in URL's HTML page |
numLinks |
URL page fratures | Total number of links embedded in URL's HTML page |
scriptLength |
URL page fratures | Total number of characters in embedded scripts in URL's HTML page |
specialCharacters |
URL page fratures | Total number of special characters in URL's HTML page |
scriptToSpecialCharacterRatio |
URL page fratures | The ratio of total length of embedded scripts to special characters in HTML page |
scriptToBodyRatio |
URL page fratures | The ratio of total length of embedded scripts to total number of characters in HTML page |
The feature_data.csv file is licensed under a Creative Commons Attribution 4.0 International License.