SVOE is a low-code declarative framework providing scalable and highly configurable pipelines for streaming and batch feature engineering, predictive model training, real-time inference and backtesting. Built on top of Ray, the framework allows to build and scale your custom pipelines from multi-core laptop to a cluster of 1000s of nodes.
SVOE was originally built to accommodate a typical financial data research workflow (i.e. for Quant Researchers) with specific data models in mind (trades, quotes, order book updates, etc., hence some examples are provided in this domain), however the framework itself is domain-agnostic and it's components can easily be generalised and used in other fields which rely on real-time time-series based data processing and simulation(anomaly detection, sales forecasting etc.)
SVOE consists of three main components, each providing a set of tools for a typical Quant/ML engineer workflow
- Featurizer helps defining, calculating and storing real-time/offline (batch) features. It uses custom stream processing engine (Ray Actors + ZeroMQ) and Kappa-architecture to calculate offline features using online pipelines
- Trainer allows training predictive models in distributed setting using popular ML libraries (XGBoost, PyTorch)
- Backtester is used to validate and test predictive models along with user defined logic (i.e. trading strategies if used in financial domain)
You can read more in docs
- Easy to use standardized and flexible data and computation models - seamlessly switch between real-time and historical data for feature engineering, ML training and backtesting
- Low code, modularity and configurability - define reusable components such as
FeatureDefinition
,DataSourceDefinition
,FeaturizerConfig
,TrainerConfig
,BacktesterConfig
etc. to easily run your experiments - Avoid train-predict inconsistency - Featurizer uses same feature definition for real-time inference and batch training
- No need for external data infra/DWH - Featurizer Storage allows to store and catalog computed features in any object storage while keeping index in any SQL backend, provides Data Exploration API
- Ray integration - SVOE runs wherever Ray runs (everywhere!)
- MLFlow integration - store, retrieve and analyze your ML models with MLFlow API
- Cloud / Kubernetes ready - use KubeRay or native Ray on AWS to scale out your workloads in a cloud
- Easily integrates with orchestrators (Airflow, Luigi, Prefect) - SVOE provides basic Airflow Operators for each component to easily orchestrate your workflows
- Real-time inference without MLOps burden - no need to maintain model containerization pipelines, FastAPI services and model registries. Deploy with simple Python API or yaml using InferenceLoop
- Designed for high volume low granularity data - as an example, when used in financial domain, unlike existing financial ML frameworks which use only OHLCV as a base data model, SVOE's Featurizer provides flexible tools to use and customize any data source (ticks, trades, book updates, etc.) and build streaming and historical features
- Minimized number of external dependencies - SVOE is built using Ray Core primitives and has no heavyweight external dependencies (stream processor, distributed computing engines, storages, etc.) which allows for easy deployment, maintenance and minimizes costly data transfers. The only dependency is an SQL database of user's choice. And it's all Python!
Install from PyPi. Be aware that Svoe requires Python 3.10+.
pip install svoe
For local environment launch standalone setup on your laptop. This will start local Ray cluster, create and populate SQLite database, spin up MLFlow tracking server and load sample data from remote store (S3). Make sure you have all necessary dependencies present
svoe standalone
For distributed setting, please refer to Running on remote clusters
For this example, we will consider a scenario which often occurs in financial markets simulation, however please note that the framework is not limited to financial data and can be used with whatever scenario user provides. As an example, here is a simple 3 step tutorial to build a simple mid-price prediction model based on past price and volatility.
-
Run Featurizer to construct mid-price and volatility features from partial order book updates, 5 second lookahead label as prediction target, using 1 second granularity data
- Define
featurizer-config.yaml
See MidPriceFD and VolatilityStddevFD for implementation detailsstart_date: '2023-02-01 10:00:00' end_date: '2023-02-01 11:00:00' label_feature_index: 0 label_lookahead: '5s' features_to_store: [0, 1] feature_configs: - feature_definition: price.mid_price_fd.MidPriceFD name: mid_price params: data_source: &id001 - exchange: BINANCE instrument_type: spot symbol: BTC-USDT feature: sampling: 1s - feature_definition: volatility.volatility_stddev_fd.VolatilityStddevFD params data_source: *id001 feature: sampling: 1s
- Run Featurizer
- CLI:
svoe featurizer run <path_to_config> --ray-address <addr> --parallelism <num-workers>
- Python API:
Featurizer.run(path=<path_to_config>, ray_address=<addr>, parallelism=<num_workers>)
- CLI:
- Once calculation is finished, load sampled
FeatureLabelSet
dataframe to your local client- CLI:
svoe featurizer get-data --every-n <every_nth_row>
- Python API:
Featurizer.get_materialized_data(pick_every_nth_row=<every_nth_row>)
timestamp receipt_timestamp label_mid_price-mid_price mid_price-mid_price feature_VolatilityStddevFD_62271b09-volatility 0 1.675234e+09 1.675234e+09 23084.800 23084.435 0.000547 1 1.675234e+09 1.675234e+09 23083.760 23084.355 0.040003 2 1.675234e+09 1.675234e+09 23083.505 23084.635 0.117757 3 1.675234e+09 1.675234e+09 23084.610 23085.020 0.257091 4 1.675234e+09 1.675234e+09 23084.725 23084.800 0.242034 ... ... ... ... ... ...
- CLI:
- We can also visualize the results
- CLI:
svoe featurizer plot --every-n <every_nth_row>
- CLI:
- Define
-
Once we have our
FeatureLabelSet
calculated and loaded in cluster memory, let's use Trainer to train XGBoost model to predict mid-price 5 seconds ahead, validate the model, tune hyperparams and pick best model- Define config
xgboost: params: tree_method: 'approx' objective: 'reg:linear' eval_metric: [ 'logloss', 'error' ] num_boost_rounds: 10 train_valid_test_split: [0.5, 0.3] num_workers: 3 tuner_config: param_space: params: max_depth: randint: lower: 2 upper: 8 min_child_weight: randint: lower: 1 upper: 10 num_samples: 8 metric: 'train-logloss' mode: 'min' max_concurrent_trials: 3
- Run Trainer
- CLI:
svoe trainer run --config-path <config-path> --ray-address <addr>
- Python API:
config = TrainerConfig.load_config(config_path) trainer_manager = TrainerManager(config=config, ray_address=ray_address) trainer_manager.run(trainer_run_id='sample-run-id', tags={})
- CLI:
- Visualize predictions
- CLI:
svoe trainer predictions --model-uri <model-uri>
- CLI:
- Select best model
- CLI:
svoe trainer best-model --metric-name valid-logloss --mode min
- Python API:
mlflow_client = SvoeMLFlowClient() best-model-uri = mlflow_client.get_best_checkpoint_uri(metric_name=metric_name, experiment_name=experiment_name, mode=mode)
- CLI:
- Define config
-
In this example, we use Backtester in the context of financial markets, hence our user-defined logic is based on a notion of trading strategy. This can be extended to any other scenario which user wants to emulate. Once we have our best model, we can plug it in our
BaseStrategy
derived class and run backtester to simulate our scenario- Define config
See MLStrategy for example implementation
featurizer_config_path: featurizer-config.yaml inference_config: model_uri: <your-best-model-uri> predictor_class_name: 'XGBoostPredictor' num_replicas: <number-of-predictor-replicas> simulation_class_name: 'backtester.strategy.ml_strategy.MLStrategy' simulation_params: buy_delta: 0 sell_delta: 0 user_defined_params: portfolio_config: <portfolio_config> tradable_instruments_params: - exchange: 'BINANCE' instrument_type: 'spot' symbol: 'BTC-USDT'
- Run Backtester
- CLI:
svoe backtester run --config-path <config-path> --ray-address <addr> --num-workers <num-workers>
- Python API:
config = BacktesterConfig.load_config(config_path) backtester = Backtester.from_config(config) backtester.run_remotely(ray_address=ray_address, num_workers=num_workers)
- CLI:
- Get stats with
backtester.get_stats()
- Define config
We try to maintain as fresh and detailed docs as possible. Please leave your feedback if you have any questions.
SVOE is an open-source first project and we would love to get feedback and contributions from the community! The project is in a very early stage and is still a work in progress, so any help would be greatly appreciated! Please feel free to open GitHub issues with questions/bugs or PRs with contributions!