Graildient Descent

Overview

Graildient Descent is a machine learning project focused on predicting the sold prices of items listed on Grailed, an online marketplace for high-end, pre-owned, and limited edition fashion. The project aims to build a comprehensive pipeline that includes data collection, preprocessing, model training, and deployment.

You can explore the project’s interactive results on the Streamlit app.

Multimodal Data

This project works with multimodal data:

Tabular Features: Item attributes like brand, category, and size
Text Features: Descriptions, titles of items, and hashtags
Images: Collected cover images of items (future work in progress)

Project Structure

The project is organized into the following main components (some of which are still under development):

graildient_descent/: Contains the core machine learning pipeline, including:
- experiment.py: Script to run machine learning experiments
- model.py: Model definition and related logic
- preprocessing.py: Data preprocessing steps and utilities
- feature_extraction.py: Text feature extraction utilities
- utils.py: Helper functions used across the project
data_collection/: Contains the scraping scripts and utilities to collect and clean data from Grailed
airflow/: Contains the ETL pipeline scripts for Apache Airflow
sweeps/: Contains Weights & Biases (wandb) sweep configurations for ML experiments
streamlit_app/: A Streamlit application to showcase the project with pages for EDA, data collection, and predictions
api/: FastAPI application for real-time price predictions, including:
- models.py: Pydantic models and enums for data validation
- config.py: Configuration settings and constants
- services.py: Business logic for predictions
- utils.py: Helper functions
- routes.py: API endpoint definitions
- main.py: Application entry point
tests/: Contains unit tests for various project components

Testing and CI/CD

This project implements automated testing using GitHub Actions. The CI pipeline runs on all pull requests to the main branch and includes a comprehensive test suite covering:

Data collection and web scraping functionality
Data preprocessing and feature engineering pipeline
Model training and evaluation components
FastAPI application endpoints and services
Utility functions and helpers

Code quality is maintained through pre-commit hooks that run locally at commit time.

Getting Started

Prerequisites

Python 3.11: Ensure you have Python 3.11 installed.
Poetry: The project uses Poetry for dependency management. If you haven't installed Poetry, you can do so by following the Poetry installation guide.

Installation

To set up the project, follow these steps:

Clone the repository:

git clone https://github.com/kirill-rubashevskiy/graildient-descent.git
cd graildient-descent

Install the dependencies and set up pre-commit hooks (for Contributors):

For End Users:

poetry install --without dev

For Contributors:

poetry install --with dev
poetry run pre-commit install

Scraper

Overview

The GrailedScraper is a Python class designed to scrape sold item listings from Grailed. It collects details such as item names, descriptions, details, sold prices, and images, which are then used for further processing and analysis.

Setup

Ensure your environment variables for Grailed credentials are set up, or pass them directly when initializing the GrailedScraper.

Example Usage

from data_collection.scraper import GrailedScraper

scraper = GrailedScraper(email='grailed_email', password='grailed_password')
listings_data, cover_imgs, errors = scraper.scrape()

Notes

Ensure you comply with Grailed’s Terms of Service when scraping data.

ETL Pipeline

Overview

The ETL (Extract, Transform, Load) pipeline is designed to collect, process and manage data from the Grailed website. The pipeline is implemented using Apache Airflow and performs the following tasks:

Extract: Collect data from Grailed using the GrailedScraper
Transform: Process and clean the collected data
Load: Load the cleaned data into the target data storage

Refer to the Airflow documentation for more details on managing and configuring DAGs.

Setup

Install Dependencies:

Ensure that you have all necessary dependencies installed. Run the following command to install the required Python packages via Poetry:
```
poetry install
```
Install Apache Airflow:

Apache Airflow must be installed using pip as Poetry installation is not supported by Apache Airflow. Install Airflow with the following command:
```
pip install apache-airflow
```
Configure Airflow:

Airflow requires a proper configuration. Set up your Airflow environment by initializing the database and starting the web server and scheduler.
```
airflow db init
airflow webserver
airflow scheduler
```
Set Up Airflow Variables:

Define any required Airflow variables (e.g., connection strings, paths) using the Airflow UI or command line.

Running the ETL Pipeline

Start Airflow:
```
airflow webserver
airflow scheduler
```
Trigger the DAG:
```
airflow dags trigger grailed_etl_dag
```

ML Experiments

Running ML Experiments

You can run machine learning experiments either as single runs or as multiple runs using wandb sweeps.

Single Experiment Run

To run a single experiment with custom arguments, use the following command:

python3 -m fire graildient_descent/experiment.py run_experiment --arg1 value1 --arg2 value2

Replace --arg1, --arg2, etc., with actual arguments and their values specific to the experiment configuration.

Running Multiple Experiments with Wandb Sweeps

For running multiple experiments using Weights & Biases (wandb) sweeps, configure your sweep in the wandb sweep configuration file, then initiate the sweep:

Create a Sweep:

First, define your sweep configuration (e.g., in sweeps/config.yaml).
Run the Sweep:

Start the sweep using:
```
wandb sweep sweeps/config.yaml
```
Launch Agents:

After starting the sweep, you can launch multiple agents to run the experiments:
```
PYTHONPATH=. wandb agent <sweep_id>
```

Experiment Results

The ML experiments achieved significant improvements over the baseline model:

Best performing model: CatBoost with combined tabular and text features
Final RMSLE: 0.64 (37.1% improvement over baseline)
Key improvements came from:
- Using CatBoost model (31% improvement over baseline)
- Combining tabular and text features (7% improvement over tabular-only model)

For detailed experiment setup and results, visit the ML Experiments section in Streamlit App.

Streamlit Application

The Streamlit App allows users to explore the project's data analysis and prediction results interactively. It includes the following pages:

Intro: Overview of the project and personal insights into Grailed
Data Collection: Describes the scraping process, collected data, and the building of the ETL pipeline (work in progress)
EDA: Explores the data through visualizations and statistics. Currently, the EDA page includes:
- Numerical Features: Sold price and photo count analysis
- Categorical Features: Department, category, designer, size, and more
- Text Features: Item name, description, and hashtags
- Images: Planned for a future stage
ML Experiments: Details the machine learning experiment setup, methodology, and results
Price Predictor: Interactive form to get price predictions for Grailed listings

The app is deployed on the Streamlit Community Hub, and you can explore it here.

FastAPI Service

The FastAPI Service provides real-time price predictions for Grailed listings through a RESTful API. The service is integrated with the Streamlit frontend.

Key Endpoints

/api/v1/predictions/url: Predict price for an existing Grailed listing using its URL
/api/v1/predictions/form: Predict price for a new listing based on provided features
/api/v1/docs/options: Get valid options for all categorical fields
/api/v1/models/info: Get information about the currently deployed model
/api/health: Health check endpoint

Running the API Service

Set up environment variables:

export PYTHONPATH=. \
AWS_ACCESS_KEY_ID=your_aws_access_key_id \
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key \
S3_MODEL_PATH=your_model_path \

Start the service:

uvicorn api.main:app --host 0.0.0.0 --port 8000

Access the API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

Project Status

The project has made significant progress:

Completed Steps

Data Collection Pipeline:
- Implemented web scraper for Grailed sold listings
- Developed and integrated ETL pipeline using Apache Airflow
- Built data cleaning and processing workflow
Data Analysis & Modeling:
- Completed extensive EDA for tabular and text features
- Conducted ML experiments, achieving 37.1% improvement over baseline
- Documented experiment methodology and results
Deployment:
- Deployed Streamlit app showcasing:
  - Data collection process
  - EDA visualizations and insights
  - ML experiment results and methodology
  - Interactive price prediction interface
- Implemented FastAPI service for real-time predictions
- Integrated frontend and backend for seamless price predictions
- Added prediction history tracking

Next Steps

Enhance prediction interface with:
- Price range estimates
- More detailed prediction explanations
Complete image feature analysis and integration
Explore deep learning approaches for potential improvements

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue if you have any suggestions or improvements.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
airflow/dags		airflow/dags
api		api
data_collection		data_collection
graildient_descent		graildient_descent
notebooks		notebooks
streamlit_app		streamlit_app
sweeps		sweeps
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
config-defaults.yaml		config-defaults.yaml
pyproject.toml		pyproject.toml
render.yaml		render.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graildient Descent

Overview

Multimodal Data

Project Structure

Testing and CI/CD

Getting Started

Prerequisites

Installation

Scraper

Overview

Setup

Example Usage

Notes

ETL Pipeline

Overview

Setup

Running the ETL Pipeline

ML Experiments

Running ML Experiments

Single Experiment Run

Running Multiple Experiments with Wandb Sweeps

Experiment Results

Streamlit Application

FastAPI Service

Key Endpoints

Running the API Service

Project Status

Completed Steps

Next Steps

Contributing

License

About

Releases

Packages

Languages

License

kirill-rubashevskiy/graildient-descent

Folders and files

Latest commit

History

Repository files navigation

Graildient Descent

Overview

Multimodal Data

Project Structure

Testing and CI/CD

Getting Started

Prerequisites

Installation

Scraper

Overview

Setup

Example Usage

Notes

ETL Pipeline

Overview

Setup

Running the ETL Pipeline

ML Experiments

Running ML Experiments

Single Experiment Run

Running Multiple Experiments with Wandb Sweeps

Experiment Results

Streamlit Application

FastAPI Service

Key Endpoints

Running the API Service

Project Status

Completed Steps

Next Steps

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages