Paper: A Time Series Classification Pipeline for Detecting Interaction Ruptures in HRI Based on User Reactions
Abstract: To be able to react to interaction ruptures such as errors, a robot needs a way of realizing such a rupture occurred. We test whether it is possible to detect interaction ruptures from the user's anonymized speech, posture, and facial features. We showcase how to approach this task, presenting a time series classification pipeline that works well with various machine learning models. A sliding window is applied to the data and the continuously updated predictions make it suitable for detecting ruptures in real-time. Our best model, an ensemble of MiniRocket classifiers, is the winning approach to the ICMI ERR@HRI challenge. A feature importance analysis shows that the model heavily relies on speaker diarization data that indicates who spoke when. Posture data, on the other hand, impedes performance.
TL;DR: Careful feature selection/preparation and models that utilize convolutions are key to successful interaction rupture predictions!
If you use our research in your work, please consider citing our paper:
@inproceedings{wachowiak_hri2024,
author = {Wachowiak, Lennart and Tisnikar, Peter and Coles, Andrew and Canal, Gerard and Celiktutan, Oya},
title = {A Time Series Classification Pipeline for Detecting Interaction Ruptures in HRI Based on User Reactions},
year = {2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3678957.3688386},
booktitle = {Proceedings of the 2024 International Conference on Multimodal Interaction},
numpages = {9},
series = {ICMI '24}
}
Change in model accuracy compared to baseline for different feature combinations
We used Python 3.11.9. All required Python packages can be installed from the requirements.txt. The easiest way is to create a virtual environment like this:
python3.11.9 -m venv venv_hri_err
source venv_hri_err/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Alternatively, if you have Apptainer installed, we provide our container definition file here. To build the container, run the following command in your terminal:
sudo apptainer build <CONTAINER_NAME>.sif hri_cont.def
Once built, you can run the container, which will start a new terminal shell from which you can then run the scripts. To train the deep learning models, make sure to run the container with the --nv
flag.
apptainer run --nv <CONTAINER_NAME>.sif
As we are not allowed to re-publish the dataset, you need to request it from the competition organizers in case you want to reproduce our work. To facilitate reproduction, this text file indicates the necessary data folder structure to run the code as is.
We offer several scripts which perform different steps of our pipeline:
- time_series_classifiers.py: run Optuna search / train single model / evaluate on competition test set / create learing curve
- data_loader.py: preprocessing & loading of datasets
- get_metrics.py: official competition metrics script
- visualizations.py: plotting script used to generate all plots contained in the main paper and the appendix
- evaluate_best_model.py: helper script to evaluate one of the saved best model configs
Model searches can be easily specified via json files. We provide many examples for the genetic searches and grid searches.
Both genetic and grid searches use the same config structure, but the grid search configs must contain grid
in their name.
To run a (grid) search, use the following command:
python HRI-Error-Detection-STAI/code/time_series_classifiers.py --config grid_search_configs/config_minirocket_grid.json --njobs -1 --type search
To the best MiniRocket model we found (as specified in the table below), set the --type
argument to be train_single
.
python HRI-Error-Detection-STAI/code/time_series_classifiers.py --njobs -1 --type train_single
To get the test predictions for the hidden competition test sets, use time_series_classifiers.py and specify the type
flag to be competition_eval
. This will generate predictions from MiniRocket models that were submitted to the competition:
python HRI-Error-Detection-STAI/code/time_series_classifiers.py --njobs -1 --type competition_eval
To run the learning curve experiment, use time_series_classifiers.py and specify the type
flag to be learning_curve
. This will produce a learning curve for each of the 4 models considered in the paper.
python HRI-Error-Detection-STAI/code/time_series_classifiers.py --njobs -1 --type learning_curve
If you wish to evaluate one of the models you trained and saved in the best_model_configs/
folder, use the evaluate_best_model.py script. Specify the name of the config using the --file
flag.
python HRI-Error-Detection-STAI/code/evaluate_best_model.py --file <YOUR_MODEL_CONFIG>.json
If you would like to reproduce all plots in the paper and the appendix, simply run the visualization.py script which will create the plots in approximate order of appearance and store PDFs into the plots/
folder. All data used to generate plots is available in plots/run_histories/
.
python HRI-Error-Detection-STAI/code/visualizations.py
Submitted models still missed the zero padding. The last column on the right shows our final best model, trained after the competition ended:
Category | Parameter | Interaction Rupture (submitted) | Robot Error (submitted) | User Awkwardness (submitted) | Interaction Rupture (best MiniRocket) |
---|---|---|---|---|---|
Task | Task | 2 | 1 | 0 | 2 |
Model | Model Type | MiniRocket | MiniRocket | MiniRocket | MiniRocket |
Data Param. | Interval Length | 1500 | 1600 | 1500 | 2500 |
Data Param. | Stride Train | 400 | 400 | 400 | 600 |
Data Param. | Stride Eval | 225 | 225 | 225 | 300 |
Data Param. | FPS | 25 | 25 | 25 | 25 |
Data Param. | Columns to Remove | vel_dist, c_openface | openpose, c_openface | vel_dist, c_openface | openpose, c_openface |
Data Param. | Label Creation | stride_eval | stride_eval | stride_eval | stride_eval |
Data Param. | NaN Handling | avg | avg | avg | avg |
Data Param. | Oversampling Rate | 0.15 | 0.2 | 0.1 | 0.1 |
Data Param. | Undersampling Rate | 0.05 | 0.0 | 0.05 | 0.1 |
Data Param. | Rescaling | normalization | none | none | none |
Data Param. | Zero Padding | False | False | False | True |
Model Param. | Number of Estimators | 25 | 20 | 20 | 10 |
Model Param. | Max Dilations per Kernel | 64 | 32 | 64 | 32 |
Model Param. | Class Weight | None | None | None | None |
Model Param. | Random State | 42 | 42 | 42 | 42 |
Performance | Accuracy (Cross-Val.) | 0.82 | 0.89 | 0.84 | 0.84 |
Performance | Macro F1 (Cross-Val.) | 0.74 | 0.77 | 0.55 | 0.76 |
Performance | Accuracy (Test) | 0.80 | 0.87 | 0.76 | N/A |
Performance | Macro F1 (Test) | 0.75 | 0.73 | 0.55 | N/A |
Best configs of other models: