AI-Powered Size Chart Generator for Apparel Sellers

Project Overview

This project aims to create an AI-powered size chart generator tailored for apparel sellers, with a focus on integration with e-commerce platforms like Flipkart. The goal is to improve the accuracy of size recommendations, which can lead to higher customer satisfaction and reduced return rates. The solution leverages Agglomerative Clustering to categorize customers based on their body measurements and subsequently uses a Random Forest classifier to provide personalized size suggestions.

Note: GRiD SDE-> size-chart-prediction has the code files- for convenience, we have uploaded all the content of size-chart-prediction in the main branch too.

Project Structure

The project is organized into several directories:

data/: Contains raw and processed data files.
- processed/: Includes the processed dataset for clustering and model training.
- raw/: Holds the original raw dataset.
models/: Stores the trained Random Forest model.
scripts/: Includes various scripts for clustering, training, evaluating, and making predictions based on the model.
tests/: Contains unit tests for data preprocessing, clustering, model training, and prediction scripts.
README.md: The project documentation.
requirements.txt: Lists the Python dependencies needed for the project.

Setup and Installation

Prerequisites

Python 3.7 or higher
Pip (Python package installer)

Installation Steps

Clone the Repository: Clone the project repository to your local machine and navigate into the project directory.
```
git clone <repository-url>
cd size-chart-prediction
```
Install Dependencies: Install the necessary Python packages using pip.
```
pip install -r requirements.txt
```

Running the Project

Step 1: Data Preparation

Ensure that the processed_data.csv file is present in the data/processed/ directory. This file should contain cleaned and preprocessed data ready for clustering.

Note: The data_preprocessing.py script automatically handles NaN values by imputing them with the most frequent value (mode) or random valid entries, ensuring no data is lost due to missing values. The script then saves the cleaned dataset in a CSV file for subsequent steps.

python3 scripts/data_preprocessing.py

Step 2: Perform Clustering

Run the clustering script to group the data based on body measurements. This will generate the clustered_data.csv file in the data/processed/ directory.

python3 scripts/clustering.py

Step 3: Analyze Clusters

Run the cluster_analysis.py script to analyze the clusters and generate visualizations.

python3 scripts/cluster_analysis.py

Step 4: Train the Model

Train the Random Forest model on the clustered data. The trained model will be saved in the models/ directory.

python3 scripts/train_model.py

Step 5: Evaluate the Model

Evaluate the model's performance by running the evaluation script. This script will output metrics such as accuracy, precision, recall, and f1-score.

python3 scripts/evaluate_model.py

Step 6: Make Predictions

To predict the size cluster for a new set of user measurements, run the prediction script. Follow the prompts to input the measurements and receive a size recommendation.

python3 scripts/predict.py

Pipeline Overview

The pipeline consists of the following stages:

Data Preprocessing: Load the processed data, perform Agglomerative Clustering on selected features, and save the clustered data.
Cluster Analysis: Perform statistical analysis on the clustered data and generate visualizations to explore feature distributions.
Model Training: Train a Random Forest classifier on the clustered data, evaluate its performance, and save the trained model.
Model Evaluation: Evaluate the trained model on a test set and generate a performance report.
Prediction: Use the trained model to predict the size cluster based on user measurements.
Testing: Validate each component of the project through unit tests.

Challenges with Dataset

Missing Values

The dataset contained missing values, particularly in the height feature. Initial attempts to drop rows with missing values led to a significant loss of data. To address this, missing values were imputed with the most frequent value or randomly selected from existing valid entries in the column.

Data Imbalance

Some clusters had a disproportionately low number of data points, which could potentially affect the model's performance. Addressing this involved careful parameter tuning and possibly oversampling underrepresented clusters during model training.

Testing

The project includes unit tests to ensure the reliability and accuracy of each component:

Data Preprocessing: Tested through the data preprocessing unit test script.
Clustering Process: Verified to ensure correct cluster assignment.
Model Training: The training process and model accuracy are validated through unit tests.
Prediction: The functionality of the prediction script is tested to ensure accurate predictions.

To execute the tests, use the provided testing framework.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Detailed Section Links

Data Directory

The data/ directory contains all the data files used in the project.

Processed Directory

The processed/ directory within data/ includes the cleaned and preprocessed dataset that is used for clustering and model training.

Raw Directory

The raw/ directory within data/ holds the original raw dataset before any processing.

Models Directory

The models/ directory stores the trained Random Forest model that has been generated after training.

Scripts Directory

The scripts/ directory includes various Python scripts for different stages of the project:

Clustering the data
Training the machine learning model
Evaluating the model's performance
Making predictions based on input features

Tests Directory

The tests/ directory contains unit tests that verify the accuracy and reliability of each component of the project, including:

Data preprocessing
Clustering process
Model training
Prediction scripts

README File

The README.md file provides documentation for the project, including setup instructions, project structure, and usage guidelines.

Requirements File

The requirements.txt file lists all the Python dependencies required to run the project, ensuring that all necessary packages are installed.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.vscode		.vscode
GRiD SDE		GRiD SDE
data		data
models		models
notebooks		notebooks
scripts		scripts
tests		tests
GRiD 6.0 Software Problem Statement.pdf		GRiD 6.0 Software Problem Statement.pdf
README.md		README.md
Readme.md		Readme.md
Submission Template - GRiD 6.0.pdf		Submission Template - GRiD 6.0.pdf
Submission_by_TLE.pptx		Submission_by_TLE.pptx
height_vs_weight.png		height_vs_weight.png
main.py		main.py
plot_features.png		plot_features.png
requirements.txt		requirements.txt
~$Submission Template - GRiD 6.0 (1).pptx		~$Submission Template - GRiD 6.0 (1).pptx
~$Submission_by_TLE.pptx		~$Submission_by_TLE.pptx

Inconsequential-24/GRiD_TLE

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Size Chart Generator for Apparel Sellers

Project Overview

Table of Contents

Project Structure

Setup and Installation

Prerequisites

Installation Steps

Running the Project

Step 1: Data Preparation

Step 2: Perform Clustering

Step 3: Analyze Clusters

Step 4: Train the Model

Step 5: Evaluate the Model

Step 6: Make Predictions

Pipeline Overview

Challenges with Dataset

Missing Values

Data Imbalance

Testing

License

Detailed Section Links

Data Directory

Processed Directory

Raw Directory

Models Directory

Scripts Directory

Tests Directory

README File

Requirements File

Visualizations

Height vs Weight

Plot Features

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages