Machine Learning and Deep Learning courses project.
Download the dataset zip file that you would like to test and put it in data folder. The following are the dataset studied in this project.
-
MotionSense: onedrive/share/motion-sense.zip
-
ScooterTrajectories: I'm sorry, this dataset is private, I hope you understand.
- onedrive/share/scooter_trajectories.zip: starting original dataset. I don't recommend to perform your test on this, because it takes more or less 20 minutes to perform all merge and filter operations.
- onedrive/share/scooter_trajectories_generated.zip: dataset already filtered and merged. It weighs less and the analysis and training can start immediately.
- Create the environment with all dependencies:
- Conda: edit the
name
field of environment.yml to change the environment name
conda env create -f environment.yml
- Virtualenv: not recommended because not tested
# Create environment virtualenv <env-name> source <env-name>/bin/activate # Install requirements pip install -r requirements.txt
- Conda: edit the
- Run the script with default configuration
python src/main.py
The project execution can be configured through a configuration file, in order to perform different operations on data. You can choose to perform only the data analysis or only the training with a selected technique, you can choose which dataset to test and how to filter and generate data, and so on.
usage: python src/main.py [-h] [--log LOG_LVL] [--config CONFIG_FILE]
optional arguments:
-h, --help show this help message and exit
--log LOG_LVL, -l LOG_LVL
log level of the project: DEBUG, INFO, WARNING, ERROR, FATAL
--config CONFIG_FILE, -c CONFIG_FILE
path to configuration file with all settings
-
--log
or-l
argument: specify the log level to print. Every message with log level higher or equal to the log level specified will be printed. The log level severity is in the following order:DEBUG
,INFO
,WARNING
,ERROR
,FATAL
. It this parameter is omitted the default value used isDEBUG
. -
--config
or-c
argument: specify the configuration file path (ex. ./config.ini) to use for test. In the configuration file you can define the behavior of the project script on the datasets. If this parameter is omitted the default configuration file taken is <proj-dir>/defconfig.ini. The syntax is the one specified in configparser python package (configparser doc). See the configuration file sub-section for more info.
The configuration file (ex. defconfig.ini) is divided in different sections, on for each dataset that this project can perform.
-
MotionSense section
This section start with
[MOTION-SENSE]
line and in the following lines you can specify the MotionSense test settings.-
skip
: boolSkip this section tests.
-
save-file
: boolSave generated results in file. Image results are saved in <proj-dir>/image folder, HTML results are saved in <proj-dir>/html folder.
-
perform-analysis
: boolRun dataset analysis and results analysis.
-
-
ScooterTrajectories section
This section start with
[SCOOTER-TRAJECTORIES]
line and in the following lines you can specify the ScooterTrajectories test settings.-
skip
: boolSkip this section tests.
-
load-original-data
: boolGenerate a new dataset form the original dataset and save it in the <proj-dir>/data/scooter_trajectories_generated folder. This operation takes about 30 minutes to perform.
-
load-generated-data
: boolLoad already filtered, merged and handled dataset placed in <proj-dir>/data/scooter_trajectories_generated folder or in <proj-dir>/data/scooter_trajectories_generated.zip file.
-
chunk-size
: intChunk size used to load the positions of original dataset, in order to be able to manage a huge amount of data.
-
max-chunk-num
: int optionalThe index of the original dataset last chunk to parse. This is a limit to speed up the load and filter of original dataset. If omitted, it will take every chunk.
-
rental-num-to-analyze
: int optionalNumber of rentals to analyze. This value is a limit used to speed up the analysis and perform it in a reduced amount of data. If omitted, all rentals will be analyzed.
-
only-north
: boolPerform analysis only in the northern part of the dataset positions in which there are the most significant data.
-
perform-heuristic
: boolPerforms timedelta heuristic, spreaddelta heuristic, edgedelta heuristic and coorddelta heuristic on generated dataset and overwrite the generated dataset with the computed heuristic columns.
-
group-on-timedelta
: boolGroups the trajectory by the timedelta heuristic division using timedelta_id, otherwise groups the trajectory by rentals using rental_id. This setting is used by spreaddelta heuristic, edgedelta heuristic, coorddelta heuristic and the performed analysis.
-
timedelta
: int optionalThe delta value that if greater than the difference in time of two positions, consider each other as different trajectories.
-
spreaddelta
: int optionalThe delta value that if lower than the difference in spread (occupied area) between trajectories, consider the trajectories part of the same group.
-
edgedelta
: int optionalThe delta value that if lower than the difference in edges (start and stop coordinates) between trajectories, consider the trajectories part of the same group.
-
perform-clustering
: stringPerform the following clustering algorithms on generated dataset positions: k-means, mean-shift, gaussian mixture, ward hierarchical and full hierarchical.
-
n-clusters
: int optionalNumber of clusters in input of clustering algorithms that need it. If omitted, it runs some WCSS clustering tests for Elbow method.
-
with-pca
: boolEnable PCA features extraction in preparation of clustering.
-
with-standardization
: boolEnable standardization of features in preparation of clustering.
-
with-normalization
: boolEnable normalization of features in preparation of clustering.
-
perform-data-analysis
: boolPerform analysis on rentals and positions as scatter plots, line plots and distribution plots. The results are saved as images in <proj-dir>/image folder.
-
perform-heuristic-analysis
: boolPerform analysis of heuristic columns (if previously performed and saved in your generated data) as scatter plots, line plots and distribution plots. The results are saved as images in <proj-dir>/image folder.
-
perform-map
: boolShow in your browser geographical maps and 3D maps of dataset generated positions in relation to heuristic process (if performed and saved in your generated data) and clustering (if
perform-clustering
istrue
). The results are saved as HTML files in <proj-dir>/html folder.
-
MDM
🐵