This project is an extensive analysis of the Pinterest Fashion Dataset, focusing on data cleaning, exploratory data analysis, and building a retrieval-augmented generation system for product recommendation based on demographic data.
The analysis is divided into two main parts:
- Data Analysis: Involves cleaning, preprocessing, and exploratory data analysis (EDA) of the fashion dataset.
- Retrieval-Augmented Generation (RAG): Development of a system that utilizes user queries and demographic data to recommend fashion items.
Before running the notebooks, ensure you have the following libraries installed:
The initial phase involves loading the dataset into a pandas DataFrame, followed by cleaning and preprocessing:
- Handling missing values by imputation or removal.
- Converting data formats, e.g., changing the price column to numeric.
Key insights derived from the EDA include relationships between price, click-through rates, and ratings. Visualizations are created using histograms and scatter plots to understand these relationships better.
Here are some of the key visualizations generated during the analysis:
This section creates a function to retrieve products based on user queries:
A simple RAG system is developed that takes a user’s age, gender, and location as input to recommend relevant products. The system utilizes either a built retrieval model or existing data to make these recommendations.
For a 35-year-old female in California interested in Shoes
Based on your interests in Shoes and considering your location in California, we recommend Converse because it has a high rating of 5 stars and is priced at just $49.50, fitting well within your budget.
For a 55-year-old male in California interested in Sunglasses
Based on your interests in Sunglasses and considering your location in California, we recommend Burberry because it has a high rating of 5 stars and is priced at just $87.50, fitting well within your budget.
This repository contains comprehensive analyses and a system capable of dynamically recommending products based on user preferences and demographics. The methodologies and technologies employed demonstrate effective data analysis and machine learning techniques in a real-world application.