The repository contains the replication package for the paper "Detect Cell-level Data Anomalies With LUCARIO".
The workflow of our tool is shown in the following graph ("Str" stands for string type columns, "Num" stands for numerical type columns, and "Mix" stands for mixed type columns):
LUCARIO follows the guidance of the coverage rate
-
Python >= 3.8
-
Pandas == 1.5.3
-
Numpy == 1.24.3
-
Scikit-learn == 1.3.2
-
Matplotlib == 3.7.5
Run constraint_inference.py
to infer the constraints in each dataset. The coverage rate (results/LUCARIO/constraints
folder; the constraints can be explicitly validated and maintained by users. During modification, please set the undesired constraints to "null" instead of deleting the entry. Here's an example:
{
"id": {
"type_constraint": "String",
"categorical_constraint": null,
"numerical_constraint": null,
"pattern_constraint": [
"tt[0-9]{7}"
]
},
......
}
After obtaining the constraints, one can easily detect data anomalies using the generated rules. Run anomaly_detection.py
to detect the anomalies. The detection CSV results will be stored under the results/LUCARIO/anomalies
folder.