Data Wrangling A3: Data Integration and Data Reshaping
- Assignment_Specifications.pdf: Assignment specifications
- Assignment_Solutions.ipynb/pdf: Assignment solutions. Python code to integrate several datasets into one single schema and find and fix possible problems in the data.
- Input data: 7 datasets in various formats and data is about housing information in Victoria, Australia.
- Input files: GTFS_Melbourne_Train_Information.zip, vic_suburb_boundary.zip, 30945305.zip.
- Output files: 30945305_A3_solution.zip
Tasks completed:
-
Task 1: Data Integration
- Integrated the 7 input files into one dataset with a specified schema mentioned in the assignment specifications.
- File types: .txt, .xlsx, json, xml, html and pdf
-
Task 2: Data Shaping
- Studied the effects of different normalization/transformation methods (i.e. standardization, min-max normalization, log, power, box-cox transformation) on various attributes.
- Observe and explain their effect.
Libraries used: pandas, numpy, re, json, bs4, tabula, scipy, matplotlib, sklearn, sklearn.model_selection, sklearn.metrics, sklearn.linear_model