This repository encompasses a detailed analysis of various regression models and their ensemble for predicting prices based on a Craigslist Used Cars dataset. The dataset consists of approximately 200,000 rows with 26 features, including Price, Year, Manufacturer, Model, Odometer, and more.
- Null Check: No null values found.
- Duplicate Removal: Eliminated duplicates based on key columns.
- Type Conversion: Converted relevant columns to numeric types.
- Outlier Handling: Removed outliers in 'price' and 'odometer.'
- Feature Scaling: Explored scaling techniques (Robust Scaler).
- Log Transformation: Applied log transformation for skewed variables.
- Correlation Analysis: Explored correlations for post-/pre-1976 cars.
- Column Removal: Dropped irrelevant columns.
- Label Encoding: Converted categorical to numerical using Label Encoding.
- Missing Values: Removed rows with missing values.
This project aims to predict prices using various regression models, including linear regression, random forest regression, polynomial regression, support vector machines, and stochastic gradient descent. Additionally, Optuna, a hyperparameter optimization framework, is employed to fine-tune the models for enhanced performance.
In this project, we explore different regression models, including linear regression, random forest regression, polynomial regression, support vector machines, and stochastic gradient descent, to predict prices. Optuna is used for hyperparameter tuning to optimize model performance.
-
Linear Regression
- Fast and interpretable.
- Assumes a linear relationship between independent variables and the target variable.
-
Random Forest Regressor
- Suitable for predicting continuous numeric values.
- Ensemble of decision trees.
-
Polynomial Regression
- Introduces polynomial terms to capture nonlinear relationships.
-
Support Vector Machines (SVM)
- Effective in capturing nonlinear relationships.
- Utilizes different kernels (e.g., RBF, polynomial, linear).
-
Stochastic Gradient Descent
- Linear model trained using stochastic gradient descent.
This project employs Optuna to fine-tune hyperparameters for each model, including Random Forest, XGBoost, LightGBM, and AdaBoost. The ensemble model, combining Random Forest, XGBoost, and LightGBM, is further optimized by adjusting the weights assigned to each model.
-
Random Forest Regression
- The Random Forest model's hyperparameters, such as the number of estimators, maximum depth, and minimum samples split, are optimized using Optuna to achieve the best possible performance.
-
XGBoost Regression
- Optuna is utilized to fine-tune hyperparameters like the booster type, maximum depth, learning rate, and the number of estimators for the XGBoost regression model.
-
LightGBM Regression
- Similar to XGBoost, the LightGBM model's hyperparameters, including the number of leaves, learning rate, feature fraction, and bagging fraction, are optimized using Optuna.
-
AdaBoost Regression
- Similar to XGBoost, the LightGBM model's hyperparameters, including the number of leaves, learning rate, feature fraction, and bagging fraction, are optimized using Optuna.
-
Ensemble Weight Optimization
- The ensemble model, combining Random Forest, XGBoost, and LightGBM, is further optimized by adjusting the weights assigned to each model. Optuna is employed to find the optimal ensemble weight configuration.
The README presents the best hyperparameters obtained for each model and provides an overall R-squared score for the final ensemble. Additionally, performance metrics such as Mean Squared Error, R-squared, Mean Absolute Error, and Root Mean Squared Error are visualized for easy comparison between models.