STAD-FEBTE: Supervised Time Series Anomaly Detection by Feature Engineering, Balancing, and Tree-based Ensembles
- STAD-FEBTE is a supervised framework for time series anomaly detection (AD) that combines automatic feature engineering with tree-based ensembles.
- Converting the time series dataset into its tabular counterpart allows generating synthetic anomalies and tackle class imbalance which is common in AD datasets.
- The framework can hanlde multivariate time series data extracted with different sampling frequencies.
- The framework allows augmenting categorical features with time series data into an identical data structure.
The framework is process-independent, but it is benchmarked on two robotized screwing datasets.
- We publish AAUWSD dataset here, which is a labeled anomaly detection dataset for robotic screwing into wood profiles with 4 classes of anomalies .
- The five classes of the dataset are:
- normal screwing
- under-tightening: occurs when termination torque is less than fastening torque.
- over-tightening: occurs when termination torque is higher than fastening torque.
- pose anomaly: occurs when misalignment between screwdriver spindle and workpiece results in slippage.
- missing screw: occurs when the feeder fails to send a screw to the screwdriver.
- This is a subset of AURSAD dataset (paper, dataset)
- To build this dataset:
- each screw tightening proces is sliced from the beginning of its engagement phase to the termination of its clamping phase.
- Insertion torque is measured as the only process attribute.
- TCP Pose, spatial velocity, and spatial acceleration are measured as task attributes.
- Collect your time series dataset in the form of a list of dictionary objects saved as a
.dat
file with following keys:ftrs_tag
: keys of the time series measurements in each sample ; pass in a list even if singlelabel_tag
: key of the label in each sampletime_tag
: key of the time vector(s) in each samplecatg_tag
: key of the categorical features in each sample
- For your convenience, we have created two synthetic datasets here showcasing how to save your dataset:
synthetic_fixed.dat
which is a sample time series dataset with fixed time vectorsynthetic_varying.dat
which is a sample time series dataset with varying time vector
- Update the
./config/config_data.yaml
file with following key-value pairs:-
preprocess
:data_path
: path of the raw time series data in [dict1, dict2, ...] format, saved as binary (.dat file)data_name
: name of the datasettab_path
: target path of the tabular datasetftrs_tag
: keys of the time series measurements in each sample ; pass in a list even if singlelabel_tag
: key of the label in each sampletime_tag
: key of the time vector(s) in each sample ; set to null if not availabletime_type
: type of the time vector of the dataset ; should be one of the {"fixed", "varying"}depth
: depth of feature extraction ; should be one of the {"minimal", "efficient", "comprehensive"}n_jobs
: number of CPUs involved in data preprocessingcatg_incl
: whether to include categorical featurescatg_tag
: key of the categorical features in each sample ; set to null if not availablerandom_state
: randome state for reproducing results
-
train
:tab_path_
: dated child directory of tab_path to read the target tabular dataset frommodel_path
: path to save trained modelsmodel_names
: list of tree-based ensembles to train ; should be in ["bagging", "rf", "extra_trees", "ada_boost", "grad_boost"] ; pass in a list even if singletrain_on_FE
: Boolean ; whether to train the model on the output of FE moduletrain_on_FS
: Boolean ; whether to train the model on the output of FS modulen_estimators
: no. estimators in ensemble treesn_jobs
: no. CPUs involved in training
-
- Run
./src/preprocess_data.py
passing the path ofconfig_data.yaml
with--config_path
command line argument. This will convert the raw time series dataset located inconfig_data["preprocess"]["data_path"]
to its tabular counterpart by feature extraction, feature selection, and anomaly generation. The result is saved in a dated subdirectory ofconfig_data["preprocess"]["tab_path"]
. - Update
config_data["train"]["tab_path_"]
to the created dated subdirectory after preprocessing and run./src/train.py
while passing the path ofconfig_data.yaml
with--config_path
as command line argument. This will train and validate ensemble trees specified inconfig_data["train"]["model_names"]
on the created tabular dataset. The trained models together with their performance metrics are saved in a dated subdirectory ofconfig_data["train"]["model_path"]
. - To apply STAD-FEBTE on any of the AURSAD, AAUWSD, synthetic_fixed, or synthetic_varying datasets, simply uncomment their corresponding part in the
./config/config_data.yaml
file. - To reload the tabular datasets for which paper results are reported look into
./data/tab/STAD-FEBTE
here. - To reload the trained ensemble trees for which the paper results are reported look into
./models/STAD-FEBTE
here.
The presented framework could outperform common deep learning models applied on the raw time series data and detect anomalies with high accuracy in terms of different metrics.
AAUWSD - STAD-FEBTE vs DL | AAUWSD - Confusion matrices |
---|---|
AURSAD - STAD-FEBTE vs DL | AURSAD - Performance metrics |
---|---|
- Python3
- Numpy
- Pandas
- tsfresh
- imbalanced-learn
- scikit-learn