This data cleaning project aims to clean and transform real life data to maintain its quality, integrity and context for the analytical process and make data ready for building a linear regression model.
This projects meticulously handles missing values, duplicate values, tranforms and converts data types, carefully considers outliers and every other form of inconsistences encountered in the data set.
The entirety of the cleaning process priorties statistical and analytical concepts to ensure that the data results in accuracy. These processes are documented in details.
- The dataset used in this project This dataset was sourced from Kaggle and can be accessed here.
- The dataset contained data quality concerns such as NAN values, outliers, duplicates, skewed data distribution, improper casing, ilegal characters etc.
- The dataset contained date values stored as objects.
- Python programming language and its manipulation and computational libraries, pandas and numpy respectively are employed.
- Python visualisation modules matplotlib and seaborn were also used.
-
Data Quality Standards and Requirements: This is define as the accepted decimal place, the data types, text casing and value data types and consistency for computation best compatible with the project and other considerations.
-
Data Profiling: Thorough examination of the dataset to understand its structure, and anomalies. Potential data quality issues, such as missing values, duplicates, outliers, and inconsistencies were visualised and identified.
-
Data Cleaning and Transformation: Python pandas and numpy libraries were employed as data cleaning tools to address identified issues such as and not limited to; removing duplicates, handling missing values,truncating different data types recorded as single value, standardising formats and data types to ensure uniformity.
-
Documentation: To communicate effectively, be transparent and ensure that this exercise can be reproduced, comments and detailed explanation were included where necessary.