Skip to content

Latest commit

 

History

History
71 lines (60 loc) · 6.41 KB

README.md

File metadata and controls

71 lines (60 loc) · 6.41 KB

Anomaly Detection with the Cloud-based UEBA Log Data Set

This repository contains scripts to carry out anomaly detection in the CLUE log data set.

The publicly available CLUE-LDS (CLoud-based User Entity behavior analytics Log Data Set) contains system log events generated by real users of a customized Nextcloud instance operated by Huemer Group. The data set spans over five years and contains events generated by more than 5000 anonymized users. Anonymization is carried out without loss of information, i.e., identical user and file names are replaced consistently with the same pseudonym. There exist 49 event types, such as events that report users logging into the system, editing files, sharing documents, changing configurations, etc. Some events also come with geolocations of users. For more information on the data set we refer to our publication at the bottom of this page.

As the CLUE-LDS only contains normal user and entity behavior, it is necessary to manually inject attacks to enable evaluation of anomaly detection systems. We simulate account hijacking by switching pairs of user identifiers (uid) at specific points in time. The first step is to check out this repository and then place the unarchived CLUE-LDS in the same directory:

user@ubuntu:~$ git clone https://github.com/ait-aecid/clue-lds.git
user@ubuntu:~$ cd clue-lds/
user@ubuntu:~/clue-lds$ wget https://zenodo.org/record/7119953/files/clue.zip
user@ubuntu:~/clue-lds$ unzip clue.zip

The next step is to compute similarities between pairs of users so that behavior patterns of users that are selected for switching are not too similar (which would make detection very difficult) and not too different (which would make detection trivial). Running the following script generates a similarity matrix sim.txt as well as the file user_info.txt containing user information that is necessary for the subsequent steps.

user@ubuntu:~/clue-lds$ python3 get_user_similarity.py

Running the script takes several minutes. Note that there are several parameters for this and all following scripts, for example, for specifying the weights of the similarity computation. Check out the parameters with python3 get_user_similarity.py -h.

The next step is to generate the log data set with injected anomalies. The following script randomly selects pairs of users based on the previously computed similarities and switches them at arbitrary points in time, which changes behavior patterns of affected users and thus enables evaluation of anomaly detection systems leveraging user and entity behavior analysis. Running the script as follows generates the file clue_anomaly.json that is identical to the original clue.json file except that some user identifiers are switched, and the labels.txt file containing the ground truth. The output of the script also shows which users were selected and at what point in time the injected anomaly occurs.

user@ubuntu:~/clue-lds$ python3 generate_test_file.py
Changing user pair with similarity 0.24:
 * User competent-aqua-hare-buildinginspector changed at 2022-09-28 00:00:00+00:00 (user originally carried out 1521 total events and 16 unique events during 59 active days).
 * User shared-fuchsia-cardinal-buildingadvisor changed at 2022-09-28 00:00:00+00:00 (user originally carried out 6356739 total events and 24 unique events during 1910 active days).
Changing user pair with similarity 0.15:
 * User dull-amethyst-buzzard-ledgerclerk changed at 2018-06-10 00:00:00+00:00 (user originally carried out 5328776 total events and 5 unique events during 291 active days).
 * User japanese-yellow-pike-thermalengineer changed at 2018-06-11 00:00:00+00:00 (user originally carried out 96841 total events and 23 unique events during 669 active days).
...

This repository contains an exemplary anomaly detection method based on event count vectors generated for each day where a user is active. Running the following script shows all relevant metrics for evaluation. Again, there are several parameters to be specified when running the program, for example, the similarity threshold that is set to 0.6 in the following example.

user@ubuntu:~/clue-lds$ python3 detect.py -t 0.6
Ground truth:
 * shared-fuchsia-cardinal-buildingadvisor switched at 2022-09-28 00:00:00+00:00
 * competent-aqua-hare-buildinginspector switched at 2022-09-28 00:00:00+00:00
 * japanese-yellow-pike-thermalengineer switched at 2018-06-10 00:00:00+00:00
 * dull-amethyst-buzzard-ledgerclerk switched at 2018-06-11 00:00:00+00:00
...
 5389 users with 83147 days considered, including days spent on training and incomplete days.
Results with threshold = 0.6:
  Total = 72469
  Train = 5289
  Detect = 72469
  Detected users = ['shared-fuchsia-cardinal-buildingadvisor', 'competent-aqua-hare-buildinginspector', 'japanese-yellow-pike-thermalengineer', 'dull-amethyst-buzzard-ledgerclerk', 'graceful-olive-spoonbill-careersofficer', 'high-chocolate-emu-liftengineer', 'careful-coffee-fowl-trafficwarden', 'southern-brown-gerbil-medicalsecretary', 'ethnic-lavender-gerbil-gamingclubmanager', 'modern-coral-crocodile-lampshademaker', 'extraordinary-plum-clownfish-sawmiller', 'hurt-aqua-roundworm-fuelmerchant', 'proud-copper-marmoset-accountsclerk']
  Missed users = ['chosen-bronze-egret-ticketagent', 'apparent-apricot-lamprey-artexer', 'horrible-moccasin-mole-licensing', 'famous-lavender-sailfish-partitionerector', 'meaningful-blue-viper-tankerdriver', 'labour-crimson-donkey-golfcaddy', 'ambitious-gold-bonobo-repairman']
  TP_adj = 13
  TP = 13
  FP = 3337
  TN = 69112
  FN_adj = 7
  FN = 7
  TPR_adj = Rec_adj = 0.65
  TPR = Rec = 0.65
  FPR = 0.04605998702535577
  TNR = 0.9539400129746443
  Prec = 0.003880597014925373
  F1 = 0.00771513353115727
  ACC = 0.9538561315872994
  R = 0.06801872476143933
  Runtime = 655.9123327732086

If you use the CLUE data set or any of the scripts provided in this repository, please cite the following publication: