By using a Python code we can integrate several datasets into one single schema and find and fix possible problems in the data. In this case we are going to use 7 different datasets in various formats about housing information in Victoria, Australia. Each of you is given 7 datasets in various formats and the data is about housing information in Victoria, Australia. The first task is to integrate all the datasets into one dataset:
- Hospitals (HTML Format)
- Supermarkets (Excel Format)
- Shopping centers (PDF Format)
- Real Estate (XML format)
- Real Estate (JSON format)
- Vic_suburb_boundary (Shape Format)
- GTFS_Melbourne_Train_Information (Text Format)
The second task is to study the effect of different normalization/transformation methods:
- Z-score Standardization
- Minmax normalization
And observe and explain their effect assuming we want to develop a linear model to predict the price of a property using Distance_to_sc, travel_min_to_CBD, and Distance_to_hospital attributes.