RAPID: Training-free Retrieval-based Log Anomaly Detection with PLM considering Token-level information
Gunho No*, Yukyung Lee*, Hyeongwon Kang and Pilsung Kang
(*equal contribution)
This repository is the official implementation of "RAPID".
RAPID/
│
├── split_data.py # Dataset splitting and preprocessing
├── preprocess_rep.py # Log representation generation via Language Model
├── ad_test_coreSet.py # Anomaly detection algorithm
├── utils.py
│
├── scripts/
│ └── all_at_once.sh # End-to-end experiment runner
│
├── processed_data/ # Directory for processed datasets
│ ├── bgl/
│ ├── tbird/
│ └── hdfs/
RAPID is evaluated on three public datasets:
- BGL (Blue Gene/L)
- Thunderbird
- HDFS
Place the raw datasets in the dataset/ directory before running the preprocessing scripts.
To reproduce all experiments from the paper:
bash scripts/all_at_once.sh
This script runs the entire pipeline, including data preprocessing, representation generation, and anomaly detection across multiple configurations as described in our paper.
- Data Splitting and Preprocessing:
python split_data.py --dataset [bgl/tbird/hdfs] --test_size 0.2
- Get Representation:
python preprocess_rep.py --dataset [bgl/tbird/hdfs] --plm bert-base-uncased --batch_size 8192 --max_token_len [128/512]
- Anomaly Detection:
python ad_test_coreSet.py --dataset [bgl/tbird/hdfs] --train_ratio 1 --coreSet 0.01 --only_cls False
--dataset
: Choose the dataset (bgl, tbird, hdfs)--sample
: Sample size for large datasets (e.g., 5000000 for Thunderbird)--plm
: Pre-trained language model (bert-base-uncased, roberta-base, google/electra-base-discriminator)--coreSet
: Core set size or ratio (0, 0.01, 0.1, etc.)--train_ratio
: Ratio of training data to use (1, 0.1, 0.01, etc.)--only_cls
: Whether to use only the CLS token representation (True/False)
After running the experiments, results will be saved in the processed_data/[dataset]/[test_size]/[plm]/results/
directory. Each experiment produces a CSV file and a JSON file with detailed performance metrics.
If you find this code useful for your research, please cite our paper:
@article{NO2024108613,
title = {Training-free retrieval-based log anomaly detection with pre-trained language model considering token-level information},
journal = {Engineering Applications of Artificial Intelligence},
volume = {133},
pages = {108613},
year = {2024},
issn = {0952-1976},
doi = {https://doi.org/10.1016/j.engappai.2024.108613},
url = {https://www.sciencedirect.com/science/article/pii/S0952197624007711},
author = {Gunho No and Yukyung Lee and Hyeongwon Kang and Pilsung Kang}
}