Graildient Descent is a machine learning project focused on predicting the sold prices of items listed on Grailed, an online marketplace for high-end, pre-owned, and limited edition fashion. The project aims to build a comprehensive pipeline that includes data collection, preprocessing, model training, and deployment.
You can explore the project’s interactive results on the Streamlit app.
This project works with multimodal data:
- Tabular Features: Item attributes like brand, category, and size
- Text Features: Descriptions, titles of items, and hashtags
- Images: Collected cover images of items (future work in progress)
The project is organized into the following main components (some of which are still under development):
- graildient_descent/: Contains the core machine learning pipeline, including:
experiment.py
: Script to run machine learning experimentsmodel.py
: Model definition and related logicpreprocessing.py
: Data preprocessing steps and utilitiesfeature_extraction.py
: Text feature extraction utilitiesutils.py
: Helper functions used across the project
- data_collection/: Contains the scraping scripts and utilities to collect and clean data from Grailed
- airflow/: Contains the ETL pipeline scripts for Apache Airflow
- sweeps/: Contains Weights & Biases (wandb) sweep configurations for ML experiments
- streamlit_app/: A Streamlit application to showcase the project with pages for EDA, data collection, and predictions
- api/: FastAPI application for real-time price predictions, including:
models.py
: Pydantic models and enums for data validationconfig.py
: Configuration settings and constantsservices.py
: Business logic for predictionsutils.py
: Helper functionsroutes.py
: API endpoint definitionsmain.py
: Application entry point
- tests/: Contains unit tests for various project components
This project implements automated testing using GitHub Actions. The CI pipeline runs on all pull requests to the main branch and includes a comprehensive test suite covering:
- Data collection and web scraping functionality
- Data preprocessing and feature engineering pipeline
- Model training and evaluation components
- FastAPI application endpoints and services
- Utility functions and helpers
Code quality is maintained through pre-commit hooks that run locally at commit time.
- Python 3.11: Ensure you have Python 3.11 installed.
- Poetry: The project uses Poetry for dependency management. If you haven't installed Poetry, you can do so by following the Poetry installation guide.
To set up the project, follow these steps:
- Clone the repository:
git clone https://github.com/kirill-rubashevskiy/graildient-descent.git
cd graildient-descent
- Install the dependencies and set up pre-commit hooks (for Contributors):
For End Users:
poetry install --without dev
For Contributors:
poetry install --with dev
poetry run pre-commit install
The GrailedScraper
is a Python class designed to scrape sold item listings from
Grailed. It collects details such as item names,
descriptions, details, sold prices, and images, which are then used for further
processing and analysis.
Ensure your environment variables for Grailed credentials are set up, or pass them
directly when initializing the GrailedScraper
.
from data_collection.scraper import GrailedScraper
scraper = GrailedScraper(email='grailed_email', password='grailed_password')
listings_data, cover_imgs, errors = scraper.scrape()
Ensure you comply with Grailed’s Terms of Service when scraping data.
The ETL (Extract, Transform, Load) pipeline is designed to collect, process and manage data from the Grailed website. The pipeline is implemented using Apache Airflow and performs the following tasks:
- Extract: Collect data from Grailed using the
GrailedScraper
- Transform: Process and clean the collected data
- Load: Load the cleaned data into the target data storage
Refer to the Airflow documentation for more details on managing and configuring DAGs.
-
Install Dependencies:
Ensure that you have all necessary dependencies installed. Run the following command to install the required Python packages via Poetry:
poetry install
-
Install Apache Airflow:
Apache Airflow must be installed using pip as Poetry installation is not supported by Apache Airflow. Install Airflow with the following command:
pip install apache-airflow
-
Configure Airflow:
Airflow requires a proper configuration. Set up your Airflow environment by initializing the database and starting the web server and scheduler.
airflow db init airflow webserver airflow scheduler
-
Set Up Airflow Variables:
Define any required Airflow variables (e.g., connection strings, paths) using the Airflow UI or command line.
-
Start Airflow:
airflow webserver airflow scheduler
-
Trigger the DAG:
airflow dags trigger grailed_etl_dag
You can run machine learning experiments either as single runs or as multiple runs using wandb sweeps.
To run a single experiment with custom arguments, use the following command:
python3 -m fire graildient_descent/experiment.py run_experiment --arg1 value1 --arg2 value2
Replace --arg1, --arg2, etc., with actual arguments and their values specific to the experiment configuration.
For running multiple experiments using Weights & Biases (wandb) sweeps, configure your sweep in the wandb sweep configuration file, then initiate the sweep:
-
Create a Sweep:
First, define your sweep configuration (e.g., in sweeps/config.yaml).
-
Run the Sweep:
Start the sweep using:
wandb sweep sweeps/config.yaml
-
Launch Agents:
After starting the sweep, you can launch multiple agents to run the experiments:
PYTHONPATH=. wandb agent <sweep_id>
The ML experiments achieved significant improvements over the baseline model:
- Best performing model: CatBoost with combined tabular and text features
- Final RMSLE: 0.64 (37.1% improvement over baseline)
- Key improvements came from:
- Using CatBoost model (31% improvement over baseline)
- Combining tabular and text features (7% improvement over tabular-only model)
For detailed experiment setup and results, visit the ML Experiments section in Streamlit App.
The Streamlit App allows users to explore the project's data analysis and prediction results interactively. It includes the following pages:
- Intro: Overview of the project and personal insights into Grailed
- Data Collection: Describes the scraping process, collected data, and the building of the ETL pipeline (work in progress)
- EDA: Explores the data through visualizations and statistics. Currently, the EDA
page includes:
- Numerical Features: Sold price and photo count analysis
- Categorical Features: Department, category, designer, size, and more
- Text Features: Item name, description, and hashtags
- Images: Planned for a future stage
- ML Experiments: Details the machine learning experiment setup, methodology, and results
- Price Predictor: Interactive form to get price predictions for Grailed listings
The app is deployed on the Streamlit Community Hub, and you can explore it here.
The FastAPI Service provides real-time price predictions for Grailed listings through a RESTful API. The service is integrated with the Streamlit frontend.
- /api/v1/predictions/url: Predict price for an existing Grailed listing using its URL
- /api/v1/predictions/form: Predict price for a new listing based on provided features
- /api/v1/docs/options: Get valid options for all categorical fields
- /api/v1/models/info: Get information about the currently deployed model
- /api/health: Health check endpoint
- Set up environment variables:
export PYTHONPATH=. \
AWS_ACCESS_KEY_ID=your_aws_access_key_id \
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key \
S3_MODEL_PATH=your_model_path \
- Start the service:
uvicorn api.main:app --host 0.0.0.0 --port 8000
- Access the API documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
The project has made significant progress:
- Data Collection Pipeline:
- Implemented web scraper for Grailed sold listings
- Developed and integrated ETL pipeline using Apache Airflow
- Built data cleaning and processing workflow
- Data Analysis & Modeling:
- Completed extensive EDA for tabular and text features
- Conducted ML experiments, achieving 37.1% improvement over baseline
- Documented experiment methodology and results
- Deployment:
- Deployed Streamlit app showcasing:
- Data collection process
- EDA visualizations and insights
- ML experiment results and methodology
- Interactive price prediction interface
- Implemented FastAPI service for real-time predictions
- Integrated frontend and backend for seamless price predictions
- Added prediction history tracking
- Deployed Streamlit app showcasing:
- Enhance prediction interface with:
- Price range estimates
- More detailed prediction explanations
- Complete image feature analysis and integration
- Explore deep learning approaches for potential improvements
Contributions are welcome! Please feel free to submit a pull request or open an issue if you have any suggestions or improvements.
This project is licensed under the MIT License - see the LICENSE file for details.