Skip to content

Baseline model

Jerry Gagelman edited this page Mar 28, 2019 · 6 revisions

Exploit prediction from CVSS

This walk-through illustrates the main steps of the data workflow, training and evaluating one of the simplest baseline models for exploit prediction: a linear classifier using CVSS features.

The input is a file of JSON records with raw fields like the one produced in data acquisition. In keeping with the steps there, it is assumed that the file $HOME/cve-data/nvd-edb-merged.jsonl exists, and each line is a JSON record which contains the keys "cveid", "cvssV2", and "exploitdb".

Step 1. Training/test split

Evaluating how well an estimator generalizes involves measuring its performance on set of test data that is disjoint from the training data. What constitutes a faithful training/test split is somewhat dependent on the problem domain and is not always immediately ascertained. Vulnerability data is intrinsically temporal. A principled approach is to split the training and test sets along a time boundary, the idea being that a random training/test split may introduce subtle leakage of future information into the past.

This approach is followed here, but using a method that is intentionally simplistic for the sake of illustration; namely, the data is sorted into coarse bins based on the YEAR component of the CVE identifier. Records with a CVE issued from 2011-2015 form the training data, and those with a CVE issued between 2016-2017 are the test data.

$ cat ~/cve-data/nvd-edb-merged.jsonl \
    | jq -c 'select(.cveid |test("CVE-201[1-5]"))' \
    > ~/cve-data/nvd-edb-train-raw.jsonl

$ cat ~/cve-data/nvd-edb-merged.jsonl \
    | jq -c 'select(.cveid |test("CVE-201[67]"))' \
    > ~/cve-data/nvd-edb-test-raw.jsonl

$ wc -l ~/cve-data/nvd-edb*.jsonl
  ...
  22352 /Users/.../nvd-edb-test-raw.jsonl
  30663 /Users/.../nvd-edb-train-raw.jsonl

So about 30k training examples, 22.3k test examples.

Step 2. Preprocessing

The repository includes a configuration object, config.json, for baseline models. This defines the preprocessing steps for converting raw fields into a uniform format:

$ ./preprocess.py config/baseline/config.json \
    ~/cve-data/nvd-edb-train-raw.jsonl \
    ~/cve-data/nvd-edb-train-prep.jsonl \
    --vocabulary ~/cve-data/nvd-edb-vocabulary.json

$ ./preprocess.py config/baseline/config.json \
    ~/cve-data/nvd-edb-test-raw.jsonl \
    ~/cve-data/nvd-edb-test-prep.jsonl

Note that the --vocabulary artifact need only be created from the training set, as that will be taken as the source of truth for the token space.

Step 3. Encoding

The same configuration object used for preprocessing also defines the steps for transforming the JSON records into columnar numpy data:

$ ./encode.py config/baseline/config.json \
    ~/cve-data/nvd-edb-vocabulary.json \
    ~/cve-data/nvd-edb-train-prep.jsonl \
    ~/cve-data/nvd-edb-train-numpy.pkl

$ ./encode.py config/baseline/config.json \
    ~/cve-data/nvd-edb-vocabulary.json \
    ~/cve-data/nvd-edb-test-prep.jsonl \
    ~/cve-data/nvd-edb-test-numpy.pkl

Training a model

The repository includes a script, linear.py, for training and evaluating a linear classifier. Its performance is generally far from state-of-the-art for virtually any task around vulnerability scoring, however it is very quick to train and is useful for smoke testing the data pipeline.

The script has two commands, train and eval, and each takes two positional arguments specifying a dataset and an estimator, respectively. The dataset argument must be an artifact emitted by encode.py.

The command line interface is somewhat verbose. The following command is used to produce an estimator from the training set created above:

$ ./linear.py train ~/cve-data/nvd-edb-train-numpy.pkl \
    ~/cve-data/nvd-edb-linear.pkl \
    --feature-key cvssV2 --label-key exploitdb

This emits an estimator to the file nvd-edb-linear.pkl, the cvssV2 data as the features and exploitdb as the labels.

Evaluating the estimator on the test set uses the same input argument with only the command and the input dataset changed:

$ ./linear.py eval ~/cve-data/nvd-edb-test-numpy.pkl \
    ~/cve-data/nvd-edb-linear.pkl \
    --feature-key cvssV2 --label-key exploitdb

... INFO metrics => {"AUC": 0.615513597301432}