-
Notifications
You must be signed in to change notification settings - Fork 8
Baseline model
This walk-through illustrates the main steps of the data workflow, training and evaluating one of the simplest baseline models for exploit prediction: a linear classifier using CVSS features.
The input is a file of JSON records with raw fields like the one produced
in data acquisition. In keeping with the steps there,
it is assumed that the file $HOME/cve-data/nvd-edb-merged.jsonl
exists,
and each line is a JSON record which contains the keys "cveid"
, "cvssV2"
,
and "exploitdb"
.
Evaluating how well an estimator generalizes involves measuring its performance on set of test data that is disjoint from the training data. What constitutes a faithful training/test split is somewhat dependent on the problem domain and is not always immediately ascertained. Vulnerability data is intrinsically temporal. A principled approach is to split the training and test sets along a time boundary, the idea being that a random training/test split may introduce subtle leakage of future information into the past.
This approach is followed here, but using a method that is intentionally simplistic for the sake of illustration; namely, the data is sorted into coarse bins based on the YEAR component of the CVE identifier. Records with a CVE issued from 2011-2015 form the training data, and those with a CVE issued between 2016-2017 are the test data.
$ cat ~/cve-data/nvd-edb-merged.jsonl \
| jq -c 'select(.cveid |test("CVE-201[1-5]"))' \
> ~/cve-data/nvd-edb-train-raw.jsonl
$ cat ~/cve-data/nvd-edb-merged.jsonl \
| jq -c 'select(.cveid |test("CVE-201[67]"))' \
> ~/cve-data/nvd-edb-test-raw.jsonl
$ wc -l ~/cve-data/nvd-edb*.jsonl
...
22352 /Users/.../nvd-edb-test-raw.jsonl
30663 /Users/.../nvd-edb-train-raw.jsonl
So about 30k training examples, 22.3k test examples.
The repository includes a configuration object, config.json
, for baseline
models. This defines the preprocessing steps for converting raw fields into
a uniform format:
$ ./preprocess.py config/baseline/config.json \
~/cve-data/nvd-edb-train-raw.jsonl \
~/cve-data/nvd-edb-train-prep.jsonl \
--vocabulary ~/cve-data/nvd-edb-vocabulary.json
$ ./preprocess.py config/baseline/config.json \
~/cve-data/nvd-edb-test-raw.jsonl \
~/cve-data/nvd-edb-test-prep.jsonl
Note that the --vocabulary
artifact need only be created from the training
set, as that will be taken as the source of truth for the token space.
The same configuration object used for preprocessing also defines the steps for transforming the JSON records into columnar numpy data:
$ ./encode.py config/baseline/config.json \
~/cve-data/nvd-edb-vocabulary.json \
~/cve-data/nvd-edb-train-prep.jsonl \
~/cve-data/nvd-edb-train-numpy.pkl
$ ./encode.py config/baseline/config.json \
~/cve-data/nvd-edb-vocabulary.json \
~/cve-data/nvd-edb-test-prep.jsonl \
~/cve-data/nvd-edb-test-numpy.pkl
The repository includes a script, linear.py
, for training and evaluating a
linear classifier.
Its performance is generally far from state-of-the-art for virtually any
task around vulnerability scoring, however it is very quick to train and
is useful for smoke testing the data pipeline.
The script has two commands, train and eval, and each takes two
positional arguments specifying a dataset and an estimator, respectively.
The dataset argument must be an artifact emitted by encode.py
.
The command line interface is somewhat verbose. The following command is used to produce an estimator from the training set created above:
$ ./linear.py train ~/cve-data/nvd-edb-train-numpy.pkl \
~/cve-data/nvd-edb-linear.pkl \
--feature-key cvssV2 --label-key exploitdb
This emits an estimator to the file nvd-edb-linear.pkl
, the cvssV2
data
as the features and exploitdb
as the labels.
Evaluating the estimator on the test set uses the same input argument with only the command and the input dataset changed:
$ ./linear.py eval ~/cve-data/nvd-edb-test-numpy.pkl \
~/cve-data/nvd-edb-linear.pkl \
--feature-key cvssV2 --label-key exploitdb
... INFO metrics => {"AUC": 0.615513597301432}