Binary classification of Sber Avtopodpiska website visitors' interactions for predefined target actions. This project contains a full pipeline of data preparation and model training, as well as model deployment as an API endpoint. In addition, deploy other services such as a database to store initial data and training results (initial data is not included in the repository, but it is part of db image), scalable API endpoint, the dashboard for visualizing training results and services for collecting and visualizing metrics for database and endpoint performance.
Sber-Avtopodpiska
├─ .env
├─ .gitignore
├─ .pre-commit-config.yaml
├─ assets
│ └─ services.svg
├─ data
│ ├─ grafana-storage
│ └─ ru_cities.csv
├─ db-init
│ ├─ 00-postgres-init.sh
│ ├─ 01-init.sql
│ └─ 02-init.sh
├─ dev
│ ├─ dashboard
│ │ ├─ app.py
│ │ ├─ AppData.py
│ │ ├─ assets
│ │ │ └─ style.css
│ │ ├─ callbacks.py
│ │ ├─ Config.py
│ │ ├─ IdHolder.py
│ │ ├─ layout.py
│ │ ├─ utils.py
│ │ └─ wsgi.py
│ └─ train
│ ├─ config.py
│ ├─ db.py
│ ├─ main.py
│ ├─ metrics.py
│ ├─ ModelWrapper.py
│ ├─ model_config.json
│ ├─ Objectives.py
│ ├─ query.sql
│ └─ train.py
├─ docker-compose.yaml
├─ Dockerfile.api
├─ Dockerfile.base-python
├─ Dockerfile.dashboard
├─ Dockerfile.db
├─ Dockerfile.ml
├─ local
│ ├─ api.py
│ ├─ main.py
│ ├─ ModelWrapper.py
│ ├─ model.pkl
│ ├─ model_config.json
│ └─ train.py
├─ notebooks
│ ├─ EDA.ipynb
│ ├─ Model Selection.ipynb
│ └─ Preprocessing.ipynb
├─ prod
│ └─ endpoint
│ ├─ api.py
│ ├─ Config.py
│ ├─ ModelWrapper.py
│ └─ train.py
├─ prometheus.yaml
├─ README.md
└─ wait-for-it.sh
Docker setup requires approximately 15 GB RAM to run all services simultaneously or 5.8 GM RAM to run db + dev-train (most memory consumptive pair, the value highly depends on training settings - model, model parameters and resampler)
Run the following command in the root directory of the project:
docker-compose up db dev-train
at least once to train the model and save it to the database. Additionally, you can include the following services in the command:
- traefik
- adminer
- grafana
- postgres-exporter
- prometheus
After that you can up the following services:
- dev-dashboard
- endpoint
Alternatively, go to the local
directory and run the following command:
python main.py
to initiate the training process. Consider putting respective data (ga_hits.csv
, ga_sessions.csv
) under the data
directory beforehand.
Using the following command:
python -m uvicorn api:app --proxy-headers --host 127.0.0.1 --port 80
you can run the API locally.
API accepts the following requests
GET
/
,/status
- return endpoint status. If running in container, additionally return container name/score
- for local endpoint only. Return ROC AUC score for model.
POST
/predict
- return predictions for one or more items. Each item should containutm_*
,device_*
andgeo_*
data. The accepted format is dict withitems
key, that contains array withdict
. Eachdict
in array represents one item. Example:
{
"items": [
{
"utm_source": false,
"utm_medium": false,
"utm_campaign": "isYoUwVPnRHJ",
"utm_adcontent": "JNHcPlZPxEM",
"utm_keyword": null,
"device_category": "mobile",
"device_os": null,
"device_brand": "Nokia",
"device_model": null,
"device_screen_resolution": "412x823",
"device_browser": "Chrome",
"geo_country": "Russia",
"geo_city": "Stavropol"
}
]
}
- ML - service for training models, making predictions on test data and saving models and metrics to a database.
- Dev Dashboard - service for visualizing train results. Available at http://dev-dashboard.localhost:8050.
(Dashboard example)
- Endpoint - service for making predictions on new data. Available at http://api.localhost:80.
- Prometheus - service for collecting metrics from services. Available at http://prometheus.localhost:9090.
- Grafana - service for visualizing metrics from database and API. Available at http://grafana.localhost:3000.
- DB - service for storing data. Available at http://db.localhost:5432.
- Adminer - service for database management. Available at http://adminer.localhost:8090.
- Postgres-exporter - service for collecting metrics from a database.
- Traefik - service for routing requests to services and API load balancer. In addition, allows to collect the metrics from API. Available at http://traefik.localhost:8080.