This repository contains codes for the paper A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets. This framework aims at reducing the training time and memory usage of a spatial machine learning model by reducing the number of partitions in a spatial grid dataset. Four types of datasets are used for experiments:
- NYC Taxi Trip Multivariate Dataset
- NYC Taxi Trip Univariate Dataset
- Washington King County Home Sales Multivariate Dataset
- Chicago Abandoned Cars Univariate Dataset
In order to experiment with NYC taxi trip dataset, download the 'Yellow Taxi Trip Records' for January 2009 from the site: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page. Put the downloaded CSV file inside this folder 'data/taxi_trip'. The file name should be 'yellow_tripdata_2009-01.csv'. Other datasets are available under 'data' folder.
Run data_preprocessing.py file. Main method contains four method calls for four datasets. In order to perform preprocessing on only one dataset, comment method calls for other datasets.
Run repartitioning.py file. Main method contains re-partitioning steps for all four datasets. In order to perform re-partitioning on only one dataset, comment repartitioning steps for other datasets. In order to experiment different threshold of information loss, change the value of the variable infoLossThreshold_NYC_Multi .
Run train_test_models.py file. Sevel methods perform training and testing on seven types of machine learning models. In order to perform training and testing on only one model, comment method calls for other machine learning models. Regression models are tested with NYC taxi trip multivariate dataset (paths are defined by first set of global variables). Spatial kriging and clustering are performed on NYC taxi trip univariate dataset (paths are defined by second set of global variables). Change the paths to different dataset in order to perform training and testing on different dataset.
K. Chowdhury, V. V. Meduri and M. Sarwat, "A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets," 2022 IEEE 38th International Conference on Data Engineering (ICDE), 2022, pp. 2426-2439, doi: 10.1109/ICDE53745.2022.00227.
@INPROCEEDINGS{9835487,
author={Chowdhury, Kanchan and Meduri, Venkata Vamsikrishna and Sarwat, Mohamed},
booktitle={2022 IEEE 38th International Conference on Data Engineering (ICDE)},
title={A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets},
year={2022},
volume={},
number={},
pages={2426-2439},
doi={10.1109/ICDE53745.2022.00227}
}