A study of Extraction Methods and SVM hyper-parameters impact on classification problem for 4 class COVID-19 and Pneumonia dataset.
To preprocess original dataset (on google colab) kaggle.json authorization file is required to be in repo directory
!git clone https://github.com/Shandelier/Classification-Experiments
%cd Classification-Experiments
!mkdir /content/Classification-Experiments/curated-chest-xray-image-dataset-for-covid19 -p
!pip install kaggle
import json
import zipfile
import os
auth = {}
with open('/content/Classification-Experiments/kaggle.json', 'r') as file:
auth = json.load(file)
os.environ['KAGGLE_USERNAME'] = auth["username"]
os.environ['KAGGLE_KEY'] = auth["key"]
# !kaggle config set -n path -v '/content/drive/MyDrive/Colab Notebooks/PWr9'
!kaggle datasets download -d unaissait/curated-chest-xray-image-dataset-for-covid19
os.chdir('/content/Classification-Experiments/curated-chest-xray-image-dataset-for-covid19')
for file in os.listdir():
zip_ref = zipfile.ZipFile(file, 'r')
zip_ref.extractall()
zip_ref.close()
! mv /content/Classification-Experiments/curated-chest-xray-image-dataset-for-covid19/curated-chest-xray-image-dataset-for-covid19.zip /content/Classification-Experiments/curated-chest-xray-image-dataset-for-covid19
%cd ..
# Unzip dataset
!unzip curated-chest-xray-image-dataset-for-covid19.zip -d /content/Classification-Experiments/curated-chest-xray-image-dataset-for-covid19
!python ./preprocess.py --dataset_dir "/content/Classification-Experiments/curated-chest-xray-image-dataset-for-covid19" --results_dir "/content/Classification-Experiments/results" --output_dir "/content/Classification-Experiments/output" --output_dataset_dir "/content/Classification-Experiments/datasets"
To analyze prepared dataset
python ./analyze_extraction.py
python ./post_extraction.py
python ./extract.py
python ./analyze_SVM.py
pythom ./post_SVM.py
The results are in "results*"
folders.
We have compared PCA and Chi2 extraction methods with basic dataset from COVID-19 Kaggle ds. The 48 features were extracted using methods from preprocess.py
file. Then in the loop we compared Balanced Accuracy score for diffrent number of preserved components.
From extraction experiment we choose the best settings (PCA with n_components=10
), and prepare dataset for this experiment. In here we compared 3 cathegories of parameters:
- Kernels – Linear, Sigmoid, RBF;
- Gammas – 1 – 0.0001
- C parameteres – 1000 – 0.01
The results were tested with Wilcoxon test to reveal statisticly different scores. Under every score in table there is a list of scores that were worst compared to this one.