Skip to content

Commit

Permalink
updated project page for neurips24
Browse files Browse the repository at this point in the history
  • Loading branch information
enjeeneer committed Dec 3, 2024
1 parent c8ffcb2 commit 9b77a07
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 62 deletions.
5 changes: 2 additions & 3 deletions index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,10 @@ David Deutsch
<pubDate>Tue, 26 Sep 2023 17:27:21 +0100</pubDate>

<guid>https://enjeeneer.io/projects/zero-shot-rl/</guid>
<description>Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp;amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
<description>NeurIPS 2024 Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp;amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
\(^{2}\) University of Bristol
[Paper] [Code] [Poster] [Slides]
Summary Figure 1: Conservative zero-shot RL methods.
Zero-shot reinforcement learning (RL) concerns itself with learning general policies that can solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training.</description>
Summary Zero-shot reinforcement learning (RL) methods learn general policies that can, in principle, solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training.</description>
</item>

<item>
Expand Down
Binary file added posters/neurips-zero-shot-rl.pdf
Binary file not shown.
5 changes: 2 additions & 3 deletions projects/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,10 @@
<pubDate>Tue, 26 Sep 2023 17:27:21 +0100</pubDate>

<guid>https://enjeeneer.io/projects/zero-shot-rl/</guid>
<description>Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp;amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
<description>NeurIPS 2024 Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp;amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
\(^{2}\) University of Bristol
[Paper] [Code] [Poster] [Slides]
Summary Figure 1: Conservative zero-shot RL methods.
Zero-shot reinforcement learning (RL) concerns itself with learning general policies that can solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training.</description>
Summary Zero-shot reinforcement learning (RL) methods learn general policies that can, in principle, solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training.</description>
</item>

<item>
Expand Down
95 changes: 39 additions & 56 deletions projects/zero-shot-rl/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,10 @@
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta name="author" content="Scott Jeen">
<meta name="description" content="Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp;amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
<meta name="description" content="NeurIPS 2024 Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp;amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
\(^{2}\) University of Bristol
[Paper] [Code] [Poster] [Slides]
Summary Figure 1: Conservative zero-shot RL methods.
Zero-shot reinforcement learning (RL) concerns itself with learning general policies that can solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training." />
Summary Zero-shot reinforcement learning (RL) methods learn general policies that can, in principle, solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training." />
<meta name="keywords" content="Scott Jeen" />
<meta name="robots" content="noodp" />
<meta name="theme-color" content="#6B8E23" />
Expand Down Expand Up @@ -60,32 +59,29 @@


<meta itemprop="name" content="Zero-Shot Reinforcement Learning from Low Quality Data">
<meta itemprop="description" content="Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
<meta itemprop="description" content="NeurIPS 2024 Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
\(^{2}\) University of Bristol
[Paper] [Code] [Poster] [Slides]
Summary Figure 1: Conservative zero-shot RL methods.
Zero-shot reinforcement learning (RL) concerns itself with learning general policies that can solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training."><meta itemprop="datePublished" content="2023-09-26T17:27:21+01:00" />
Summary Zero-shot reinforcement learning (RL) methods learn general policies that can, in principle, solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training."><meta itemprop="datePublished" content="2023-09-26T17:27:21+01:00" />
<meta itemprop="dateModified" content="2024-06-14T09:50:27+01:00" />
<meta itemprop="wordCount" content="526"><meta itemprop="image" content="https://enjeeneer.io"/>
<meta itemprop="wordCount" content="410"><meta itemprop="image" content="https://enjeeneer.io"/>
<meta itemprop="keywords" content="" />
<meta name="twitter:card" content="summary_large_image"/>
<meta name="twitter:image" content="https://enjeeneer.io"/>

<meta name="twitter:title" content="Zero-Shot Reinforcement Learning from Low Quality Data"/>
<meta name="twitter:description" content="Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
<meta name="twitter:description" content="NeurIPS 2024 Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
\(^{2}\) University of Bristol
[Paper] [Code] [Poster] [Slides]
Summary Figure 1: Conservative zero-shot RL methods.
Zero-shot reinforcement learning (RL) concerns itself with learning general policies that can solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training."/>
Summary Zero-shot reinforcement learning (RL) methods learn general policies that can, in principle, solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training."/>



<meta property="og:title" content="Zero-Shot Reinforcement Learning from Low Quality Data" />
<meta property="og:description" content="Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
<meta property="og:description" content="NeurIPS 2024 Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp; Jonathan M. Cullen\(^{1}\) \(^{1}\) University of Cambridge
\(^{2}\) University of Bristol
[Paper] [Code] [Poster] [Slides]
Summary Figure 1: Conservative zero-shot RL methods.
Zero-shot reinforcement learning (RL) concerns itself with learning general policies that can solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training." />
Summary Zero-shot reinforcement learning (RL) methods learn general policies that can, in principle, solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://enjeeneer.io/projects/zero-shot-rl/" /><meta property="og:image" content="https://enjeeneer.io"/><meta property="article:section" content="projects" />
<meta property="article:published_time" content="2023-09-26T17:27:21+01:00" />
Expand Down Expand Up @@ -189,77 +185,64 @@ <h3 class="time"><a href="https://enjeeneer.io/projects/zero-shot-rl/"></a></h3>


<div class="post-content">
<h3 id="scott-jeen1-tom-bewley2--jonathan-m-cullen1">Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp; Jonathan M. Cullen\(^{1}\)</h3>
<h2 id="neurips-2024">NeurIPS 2024</h2>
<h3 id="scott-jeen1-tom-bewley2--jonathan-m-cullen1">Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), &amp; Jonathan M. Cullen\(^{1}\)</h3>
<p>\(^{1}\) <em>University of Cambridge</em></p>
<p>\(^{2}\) <em>University of Bristol</em></p>
<p><strong><a href="https://arxiv.org/abs/2309.15178">[Paper]</a></strong>
<strong><a href="https://github.com/enjeeneer/zero-shot-rl">[Code]</a></strong>
<strong><a href="https://enjeeneer.io/posters/conservative-world-models.pdf">[Poster]</a></strong>
<strong><a href="https://enjeeneer.io/posters/neurips-zero-shot-rl.pdf">[Poster]</a></strong>
<strong><a href="https://enjeeneer.io/slides/neurips24/slides.html">[Slides]</a></strong></p>
<hr>
<h2 id="summary">Summary</h2>
<figure><img src="https://github.com/enjeeneer/conservative-world-models/blob/main/media/vcfb-intuition-final.png?raw=true"/>
</figure>

<p><em>Figure 1: <strong>Conservative zero-shot RL methods.</strong></em></p>
<p>Zero-shot reinforcement learning (RL) concerns itself with learning general policies that can solve any unseen task in an environment. Recently,
<p>Zero-shot reinforcement learning (RL) methods learn general policies that can, in principle, solve any unseen task in an environment. Recently,
methods leveraging successor features and successor measures have emerged as viable
zero-shot RL candidates, returning near-optimal policies
for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training. Most real datasets,
like historical logs created by existing control systems, are smaller and less diverse than these current methods expect. As a result, this paper asks:</p>
<p><strong>Can we still perform zero-shot RL with small, homogeneous datasets?</strong></p>
<hr>
<h2 id="intuition">Intuition</h2>
<figure><img src="https://github.com/enjeeneer/conservative-world-models/blob/main/media/overestimates.png?raw=true"/>
</figure>

<p><em>Figure 2: <strong>FB value overestimation with respect to dataset size \(n\) and quality</strong>.</em></p>
<p>When the dataset is inexhaustive, existing methods like FB representations overestimate the value of actions not in the dataset. The above
<p>When the dataset is inexhaustive, existing methods like FB representations overestimate the value of actions not in the dataset (Figure 2). The above
shows this overestimation as dataset size and quality is varied. The smaller and less diverse the dataset, the more
\(Q\) values tend to be overestimated.</p>
<p>We fix this by suppressing the predicted values (or measures) for actions not in the dataset, and show how this resolves overestimation in Figure 3&ndash;a modified version of Point-mass Maze from the ExORL benchmark.
Episodes begin with a point-mass initialised in the upper left of the maze (⊚), and the agent is tasked with selecting
\(x\) and \(y\) tilt directions such that the mass is moved towards one of two goal locations (⊛ and ⊛). The action space is two-dimensional
and bounded in \([−1,1]\). We take the RND dataset and remove all “left” actions such that \(a_x \in [0, 1]\) and
\(a_y \in [−1, 1]\), creating a dataset that has the necessary information for solving the tasks, but is inexhaustive (below (a)).
We train FB and VC-FB on this dataset and plot the highest-reward trajectories–below (b) and (c).
FB overestimates the value of OOD actions and cannot complete either task. Conversely, VC-FB synthesises the requisite
information from the dataset and completes both tasks.</p>
<figure><img src="https://github.com/enjeeneer/conservative-world-models/blob/main/media/didactic.png?raw=true"/>
<figure><img src="https://github.com/enjeeneer/conservative-world-models/blob/main/media/overestimates.png?raw=true"/>
</figure>

<p><em>Figure 3: <strong>Ignoring out-of-distribution actions</strong>.</em></p>
<hr>
<h2 id="aggregate-performance-on-suboptimal-datasets">Aggregate Performance on Suboptimal Datasets</h2>
<figure><img src="https://github.com/enjeeneer/zero-shot-rl/blob/main/media/performance-profiles-subplot1.png?raw=true"/>
<p><small><em>Figure 1: <strong>FB value overestimation with respect to dataset size \(n\) and quality</strong>. \(Q\) values predicted during training
increase as both the size and “quality&quot; of the dataset decrease. This contradicts the low return of all resultant
policies (note: a return of 1000 is the maximum achievable for this task).</em> </small></p>
<p>We fix this by suppressing the predicted values (or measures) for state-actions not in the dataset.</p>
<figure><img src="https://github.com/enjeeneer/conservative-world-models/blob/main/media/vcfb-intuition-final.png?raw=true"/>
</figure>

<p><em>Figure 4: <strong>Aggregate zero-shot RL performance on ExORL benchmark.</strong></em></p>
<p>Both MC-FB and VC-FB outperform FB and outperform our single-task baseline in expectation, reaching 111% and 120% of
CQL performance respectively <strong>despite not having access to task-specific reward labels and needing
to fit policies for all tasks</strong>. This is a surprising result, and to the best of our knowledge, the first time
a multi-task offline agent has been shown to outperform a single-task analogue.</p>
<p><small><em>Figure 2: <strong>Conservative zero-shot RL methods</strong>. VC-FB (right) suppresses the predicted values for OOD state-action pairs**</em></small></p>
<hr>
<h2 id="performance-with-respect-to-dataset-size">Performance with respect to dataset size</h2>
<figure><img src="https://enjeeneer.io/img/zero-shot-rl-dataset-size.png" width="600"/>
<h2 id="results">Results</h2>
<p>We demonstrate our methods improve performance w.r.t standard zero-shot RL / GCRL baselines on low quality datasets from ExORL (Figure 4) and D4RL (Figure 5) .
<figure><img src="https://github.com/enjeeneer/zero-shot-rl/blob/main/media/performance-profiles-subplot2.png?raw=true"/>
</figure>
</p>
<p><small><em>Figure 3: <strong>Aggregate zero-shot performance on ExORL.</strong> (Left) IQM of task scores across datasets and domains,
normalised against the performance of CQL, our baseline. (Right) Performance profiles showing the distribution
of scores across all tasks and domains. Both conservative FB variants stochastically dominate vanilla FB.</em></small></p>
<figure><img src="https://github.com/enjeeneer/zero-shot-rl/blob/main/media/d4rl-performance.png?raw=true" width="50%"/>
</figure>

<p><em>Figure 5: <strong>Performance by RND dataset size.</strong></em></p>
<p>The performance gap between conservative FB variants and vanilla FB increases as dataset size decreases. On the full datasets, conservative FB variants maintain (and slightly exceed) the performance
of vanilla FB.</p>
<p><em>Table 1: <strong>Performance on full datasets.</strong></em></p>
<figure><img src="https://enjeeneer.io/img/zero-shot-rl-full-dataset-results-table.png" width="600"/>
<p><small><em>Figure 4: <strong>Aggregate zero-shot RL performance on D4RL.</strong> Aggregate IQM scores across all domains and datasets,
normalised against the performance of CQL.</em></small></p>
<figure><img src="https://enjeeneer.io/img/zero-shot-rl-dataset-size.png" width="450"/>
</figure>

<p><small><em>Figure 5: <strong>Performance by RND dataset size.</strong> The performance gap between conservative FB variants and vanilla FB increases as dataset size decreases</em></small></p>
<hr>
<h2 id="citation">Citation</h2>
<p>Read the full paper for more details: <strong><a href="https://arxiv.org/abs/2309.15178">[link]</a></strong>, and if you find this work informative please consider citing it:</p>
<pre tabindex="0"><code class="language-commandline" data-lang="commandline">@article{jeen2023,
<p>Read the <a href="https://arxiv.org/abs/2309.15178">full paper</a> for more details!</p>
<pre tabindex="0"><code class="language-commandline" data-lang="commandline">@article{jeen2024,
url = {https://arxiv.org/abs/2309.15178},
author = {Jeen, Scott and Bewley, Tom and Cullen, Jonathan M.},
title = {Conservative World Models},
publisher = {arXiv},
year = {2023},
title = {Zero-Shot Reinforcement Learning from Low Quality Data},
journal = {Advances in Neural Information Processing Systems 38},
year = {2024},
}
</code></pre><hr>

Expand Down

0 comments on commit 9b77a07

Please sign in to comment.