NEVIS’22 is a benchmark for measuring the performance of algorithms in the field of continual learning. Please see the accompanying paper for more details.
Within this Python package, we provide three components,
- Library code to download and post-process datasets that are not available within tfds, so that the stream used in the paper can be replicated.
- A package to combine the NEVIS’22 datasets into a stream, and robustly evaluate learners using the evaluation protocol proposed in the NEVIS’22 paper.
- Baseline learners implemented in JAX and PyTorch. The JAX learners are identical to the learners used for the figures in the paper, the PyTorch learners are provided for example purposes.
NEVIS’22 is composed of 106 tasks chronologically sorted and extracted from publications randomly sampled from online proceedings of major computer vision conferences over the past three decades. Each task is a supervised classification task, which is the most well understood setting in machine learning. The challenge is how to automatically transfer knowledge across related tasks in order to achieve a higher performance or be more efficient on the next task.
By construction, NEVIS’22 is reproducible, diverse and at a scale sufficiently large to test state of the art learning algorithms. The task selection process does not favor any particular approach, but merely tracks what the computer vision community has deemed interesting over time. NEVIS’22 is not just about data, it is also about the methodology used to train and evaluate learners. We evaluate learners in terms of their ability to learn future tasks, as measured by their trade-off between error rate and compute measured in the number of floating-point operations. In NEVIS’22, achieving lower error rate is by itself not sufficient, if this comes at an unreasonable computational cost. Instead, we incentivise both accurate and efficient models.
You can read more about NEVIS'22 in our paper and our blog post.
Please follow these steps and read in details section 1. and 2. before launching anything.
-
Our datasets use the Tensorflow(-datasets) API. Our JAX learners use TensorFlow and JAX, and our PyTorch Learners use PyTorch. Each (datasets, jax learners, and pytorch learners) have their own
requirements.txt
that you are welcome to install withpip
and a Python version above 3.8. -
It is also possible to run the code directly using the provided Dockerfiles. see here for installing Docker.
-
Some datasets are downloaded from Kaggle. See the Kaggle website for configuring your credentials, and place them in the folder ~/.kaggle.
In NEVIS'22, we train and evaluate on streams. Each stream is a sequence of datasets. Some streams have a large number of datasets, up to 106, allowing us to evaluate Large-Scale Continual Learning.
There are three different sources for datasets in NEVIS'22:
-
Datasets on Tensorflow-Datasets (TFDS): they will be downloaded automatically when needed
-
Custom dataset downloaders: you need the
./build_dataset.sh
script -
Manual dataset download: you need to download data yourself
Note that we do not host or distribute these datasets, instead we provide URLS to their original source to help you download them at your own risk. We do not vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset URL to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!
We do our best to keep datasets URLs up-to-date. If a dataset doesn't download, please contact the dataset owners and open an issue to let us know. If a dataset doesn't get fixed by the owners, we will remove it from our benchmark.
Different streams are available, each is made of a sequence of datasets. When iterating the datasets of a stream, TFDS datasets are automatically downloaded on-the-fly if they don't exist.
Many datasets implemented in Nevis can be automatically downloaded. This has to
be done in advance of training with the script ./build_dataset.sh
:
$ ./build_dataset.sh -h
Usage:
-d <DATASET_NAME> | Dataset name
-s <STREAM_NAME> | Stream variant name (FULL|SHORT|TINY|DEBUG|...)
-b | Build docker before running
-e | Develop mode where code is mounted
-h | Help message
If running for the first time, pass the option -b
alongside other commands to
build the docker (nevis-data
). The develop mode is useful if you need to
change the codebase (e.g. for adding a new dataset) and need to debug quickly
without having to re-building the docker everytime (you still need to build the
docker in develop mode! -b -e
).
See in dm_nevis/streams/nevis_stream.py
the enum NEVISStreamVariant
for the
full list of downloadable streams.
Some datasets are downloaded from Kaggle. See on the
Kaggle website how to configure your
credentials and place them in the folder ~/.kaggle
.
ImageNet is a TFDS Dataset but it needs to be downloaded manually. Please check the instructions.
For info, TFDS will look for datasets in the directory defined by the
environment variable TFDS_DATA_DIR
.
Each experiment consists of training a model on a stream of multiple datasets. Thus, this command will train a model on each dataset. We provide two main paradigms of learners: independent and finetune from previous. In the former, we create a new randomly initialized model for each dataset. In the latter, a model is initialized for the first dataset of the stream, and tuned sequentially for all datasets.
To launch an experiment, run:
./launch_local.sh <X> example
With <X>
being the framework to use (jax
or torch
), second argument is the
config to use.
Note that for the torch version, if you want to run on gpu instead of cpu, you
need to provide the gpu id with --device <GPU_ID>
. By default, the code is
using the id -1
to symbolize cpu.
By default the metrics computed by experiments_<X>/metrics/nevis_metrics.py
will be written in ./nevis_output_dir
.
You can specify a different path by overriding the environment variable
NEVIS_DEFAULT_OUTPUT_DIR
.
The TensorBoard events file will be saved to ~/nevis/tensorboard
. Each run
will create a folder below this directory named with the date and time when the
run was launched.
The tensorboard can be launched with the following command.
tensorboard --lodir=~/nevis/tensorboard
You will need to have tensorboard
installed outside the docker using
pip install tensorboard
Regarding the different groups of plots on tensorboard dashboard: -
benchmark_metrics
contains metrics from prediction events across the stream,
where the x-axis is the index (0-based) of the most training event. -
train_event_<i>
contains training and validation metrics on the training index
with index i
.
Let's take an example learner (returns always zeros) that we will "train" on the DEBUG stream made of Pascal VOC 2007 and Coil100 datasets.
Pascal VOC 2007 is a TFDS dataset so it will be automatically downloaded when needed.
First we download Coil100 dataset:
./build_dataset.sh -e -b -s debug
Note that since the DEBUG stream only downloads Coil100, we could also have used
-d coil100
instead of -s debug
. As you can see in the script
build_dataset.sh
, we download data in ~/nevis
. You can change this directory
by overriding the env variable LOCAL_DIR
in the script.
Then, we launch the example learner:
./launch_local.sh jax example
Note that the stream DEBUG
is already specified in the config
./experiments_jax/config/example.py
.
We provide several baselines, defined in the learners/
directory with configurations
in the configs/
directory. Note that the same approach might have multiple configurations.
Reminder, to run configuration configs/X.py
, do ./launch_local.sh jax X.py
.
We provide the following baselines:
- Independent, in
configs/finetuning_ind.py
where each dataset is learned by an independent model - Previous, in
configs/finetuning_prev.py
where we learn sequentially each dataset and initialize its parameters from the parameter vector learned on the previous task. - Dynamic, in
configs/finetuning_dknn.py
. where the initialization of task T is chosen among the models which have been trained on a dataset most similar to the current dataset. This baseline performs hyperparameter tuning while learning the task, following the protocol described in our tech report.
Variants are also proposed, such as cheaper configurations in configs/cheap_finetuning_dknn.py
which use a smaller net and fewer trials of hyper-parameter search. These are the best entry point for people who have access to only one or few GPUs.
It is also possible to run a pretrained model on the Nevis stream. First train
your own pretrained model. For example on ImageNet, run the configuration configs/pretrain_imagenet.py
. Collect the resulting checkpoint, see configuration file to see where it's saved.
Then, use this checkpoint for configs/finetuning_ind_pretrained.py
.
The code is structured as follows:
|--- dm_nevis/
| |--- benchmarker/
| |--- datasets_storage/
| |--- streams/
|--- experiments_jax/
| |--- launch.py
| |--- experiment.py
| |--- configs/
| |--- learners/
| |--- metrics/
| |--- environment/
| |--- training/
|--- experiments_torch/
| |--- launch.py
| |--- experiment.py
| |--- configs/
| |--- learners/
| |--- metrics/
| |--- environment/
| |--- training/
dm_nevis/
is the library of the benchmark, containing the benchmarker/
library, which implements the evaluation protocol used in the paper.
datasets_storage/
is a package to support the downloading and preparation of
datasets, and streams/
is a package defining different streams.
There are two directories containing baseline model implementations, one for jax
(experiments_jax
), and one for pytorch (experiments_torch
). In each,
launch.py
is the Docker entrypoint, experiment.py
is the module where all
the execution happens, configs/
provides the hyperparameters for each learner,
learners/
implements the learners (note: in some cases, there are different
configs for the same learner), metrics/
implements the metrics used in
NEVIS'22, environment/
provides the logger and checkpointer, and training/
provides learner-agnostic utilities such as the heads, the backbone, but also a
flops counter for example.
If you wish to contact us, please raise a GitHub issue.
If you are using the NEVIS'22 benchmark, please cite the following paper,
@article{bornschein2022nevis,
author={Bornschein, J\"org and Galashov, Alexandre and Hemsley, Ross and Rannen-Triki, Amal and Chen, Yutian and Chaudhry, Arslan and He, Xu Owen and Douillard, Arthur and Caccia, Massimo and Feng, Qixuang and Shen, Jiajun and Rebuffi, Sylvestre-Alvise and Stacpoole, Kitty and de las Casas, Diego and Hawkins, Will and Lazaridou, Angeliki and Teh, Yee Whye and Rusu, Andrei A. and Pascanu, Razvan and Ranzato, Marc'Aurelio},
title={Nevis\'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision Research},
journal={CoRR},
volume={abs/2211.11747},
year={2022},
url={https://arxiv.org/abs/2211.11747},
eprinttype={arXiv}
}