This project aims to create an AI-powered size chart generator tailored for apparel sellers, with a focus on integration with e-commerce platforms like Flipkart. The goal is to improve the accuracy of size recommendations, which can lead to higher customer satisfaction and reduced return rates. The solution leverages Agglomerative Clustering to categorize customers based on their body measurements and subsequently uses a Random Forest classifier to provide personalized size suggestions.
Note: GRiD SDE-> size-chart-prediction has the code files- for convenience, we have uploaded all the content of size-chart-prediction in the main branch too.
- Project Overview
- Project Structure
- Setup and Installation
- Running the Project
- Pipeline Overview
- Challenges with Dataset
- Testing
- License
The project is organized into several directories:
- data/: Contains raw and processed data files.
- processed/: Includes the processed dataset for clustering and model training.
- raw/: Holds the original raw dataset.
- models/: Stores the trained Random Forest model.
- scripts/: Includes various scripts for clustering, training, evaluating, and making predictions based on the model.
- tests/: Contains unit tests for data preprocessing, clustering, model training, and prediction scripts.
- README.md: The project documentation.
- requirements.txt: Lists the Python dependencies needed for the project.
- Python 3.7 or higher
- Pip (Python package installer)
-
Clone the Repository: Clone the project repository to your local machine and navigate into the project directory.
git clone <repository-url> cd size-chart-prediction
-
Install Dependencies: Install the necessary Python packages using pip.
pip install -r requirements.txt
Ensure that the processed_data.csv
file is present in the data/processed/
directory. This file should contain cleaned and preprocessed data ready for clustering.
Note: The data_preprocessing.py
script automatically handles NaN values by imputing them with the most frequent value (mode) or random valid entries, ensuring no data is lost due to missing values. The script then saves the cleaned dataset in a CSV file for subsequent steps.
python3 scripts/data_preprocessing.py
Run the clustering script to group the data based on body measurements. This will generate the clustered_data.csv
file in the data/processed/
directory.
python3 scripts/clustering.py
Run the cluster_analysis.py script to analyze the clusters and generate visualizations.
python3 scripts/cluster_analysis.py
Train the Random Forest model on the clustered data. The trained model will be saved in the models/
directory.
python3 scripts/train_model.py
Evaluate the model's performance by running the evaluation script. This script will output metrics such as accuracy, precision, recall, and f1-score.
python3 scripts/evaluate_model.py
To predict the size cluster for a new set of user measurements, run the prediction script. Follow the prompts to input the measurements and receive a size recommendation.
python3 scripts/predict.py
The pipeline consists of the following stages:
- Data Preprocessing: Load the processed data, perform Agglomerative Clustering on selected features, and save the clustered data.
- Cluster Analysis: Perform statistical analysis on the clustered data and generate visualizations to explore feature distributions.
- Model Training: Train a Random Forest classifier on the clustered data, evaluate its performance, and save the trained model.
- Model Evaluation: Evaluate the trained model on a test set and generate a performance report.
- Prediction: Use the trained model to predict the size cluster based on user measurements.
- Testing: Validate each component of the project through unit tests.
The dataset contained missing values, particularly in the height feature. Initial attempts to drop rows with missing values led to a significant loss of data. To address this, missing values were imputed with the most frequent value or randomly selected from existing valid entries in the column.
Some clusters had a disproportionately low number of data points, which could potentially affect the model's performance. Addressing this involved careful parameter tuning and possibly oversampling underrepresented clusters during model training.
The project includes unit tests to ensure the reliability and accuracy of each component:
- Data Preprocessing: Tested through the data preprocessing unit test script.
- Clustering Process: Verified to ensure correct cluster assignment.
- Model Training: The training process and model accuracy are validated through unit tests.
- Prediction: The functionality of the prediction script is tested to ensure accurate predictions.
To execute the tests, use the provided testing framework.
This project is licensed under the MIT License. See the LICENSE file for more details.
The data/
directory contains all the data files used in the project.
The processed/
directory within data/
includes the cleaned and preprocessed dataset that is used for clustering and model training.
The raw/
directory within data/
holds the original raw dataset before any processing.
The models/
directory stores the trained Random Forest model that has been generated after training.
The scripts/
directory includes various Python scripts for different stages of the project:
- Clustering the data
- Training the machine learning model
- Evaluating the model's performance
- Making predictions based on input features
The tests/
directory contains unit tests that verify the accuracy and reliability of each component of the project, including:
- Data preprocessing
- Clustering process
- Model training
- Prediction scripts
The README.md
file provides documentation for the project, including setup instructions, project structure, and usage guidelines.
The requirements.txt
file lists all the Python dependencies required to run the project, ensuring that all necessary packages are installed.