-
Notifications
You must be signed in to change notification settings - Fork 35
IRL Toolkit developed by Sergey Levine (Taken from https://graphics.stanford.edu/projects/gpirl/)
vvanirudh/IRL-Toolkit
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Inverse Reinforcement Learning Toolkit Sergey Levine, 2011 1. Introduction 2. Installation 3. Usage 3.1 Running a single test 3.2 Running a transfer test 3.3 Running a series of tests 3.4 Using human generated examples 4. Directory overview 4.1 IRL algorithms 4.2 MDP solvers 4.3 Example domains 4.4 General 5. Calling conventions 5.1 IRL algorithms 5.2 MDP solvers 5.3 Example domains 6. License 1. Introduction This MATLAB package contains a collection of inverse reinforcement learning algorithms and a framework for evaluating these algorithms on a variety of simple Markov decision processes. The package is distributed primarily as the reference implementation for the Gaussian Process Inverse Reinforcement Learning algorithm described in the paper "Nonlinear Inverse Reinforcement Learning with Gaussian Processes" (Levine, Popovic, Koltun, NIPS 2011), but is also intended to provide a general-purpose framework for researchers interested in inverse reinforcement learning. This file describes the contents of the package, provides instructions regarding its use, instructions for extending the package with new algorithms, solvers, and example domains, and finally contains the license under which the package may be used and distributed. 2. Installation The IRL toolkit was built with MATLAB R2009b. Earlier version of MATLAB may be sufficient, but were not tested. In addition to MATLAB R2009b and various toolboxes, which may include the Statistics and Parallelism toolboxes, the package requires the following external MATLAB scripts: - minFunc, by Mark Schmidt, included with this package (Creative Commons by-nc, see <http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html> for details) - CVX, by Michael C. Grant and Stephen P. Boyd, included with this package (GPL) - plot2svg, by Juerg Schwizer, included with this package (FreeBSD) With the exception of CVX, none of the scripts require installation. Installation consists of two easy steps: 1. Extract all files in irl_toolkit.zip into a desired directory. 2. Run Utilities/cvx_setup to install CVX. 3. Usage This section briefly describes how to use the IRL toolkit. The toolkit is designed to be modular and extendable. Most sessions will begin by running "addpaths" in the main directory, to add all necessary subdirectories to the path. Optionally, you may save your MATLAB path to avoid needing to do this, but this is left up to you, since some users may prefer not to clutter up their path. 3.1 Running a single test An example script for running a test is provided under the name singletest.m in the main directory. This script calls "addpaths" to add the necessary paths, and then runs a single test using GPIRL on a 32x32 objectworld environment. The function "runtest" creates a random environment of a specified type, creates synthetic examples (or, optionally, uses provided human generated examples), and calls the specified IRL algorithm to determine the reward function. The result of the IRL algorithm is evaluated according to a set of provided metrics, and the results are returned in a single structure. The arguments of runtest are the following: 1. Algorithm name. This always corresponds to the name of the directory containing the algorithm, which is also the prefix of every function for that algorithm. 2. Algorithm parameters. This is a struct containing the desired parameters for this algorithm. For more information, see <algorithm name>defaultparams.m in the desired algorithm's directory. 3. MDP type. This is the MDP solver used to generate synthetic examples. By default, two options are available: standardmdp, which uses the standard Bellman equations, and linearmdp, which uses the LMDP framework. 4. Example name. This is the directory name/prefix of the environment to be tested. 5. Example parameters. These are the parameters of the test environment. See <example name>defaultparams.m in the desired example's directory. 6. Test parameters. These are parameters of the test itself, including the number of examples. See General/setdefaultparams.m for details. A quick way to view the results of the test is provided by the printresults function. The visualize function renders the results of the test to the screen, with a side by side comparison of the true and learned reward functions. 3.2 Running a transfer test To run a transfer test -- that is, to test how well a learned reward function generalizes to another state space, -- use the runtransfertest function. The arguments of this function are the following: 1. IRL results returned by runtest. If runtest returns test_result, this is test_result.irl_result 2. Algorithm name, as before. 3. MDP type. 4. Example name. Should be the same as the example in the original test. 5. Example parameters. These may be different from the original test (in fact, they should be to test any interesting transfer). However, some parameters may be incompatible -- for example, in the objectworld, the number of features depends on "n", so testing transfer with a different values of "n" will not work when using discrete features. 6. Transfer test parameters. See General/settransferdefaultparams.m for details. The results of the transfer test can similarly be printed and viewed using printresults and visualize. 3.3 Running a series of tests The functions runtestseries and runtransferseries can be used to run a large series of tests, using a variety of algorithms and parameter settings. The package contains scripts for graphing and saving the results of such tests. Examples of how to use these functions can be found in the NIPS11_Tests directory, which contains the scripts that were used to produce the graphs in the 2011 NIPS paper (note that, due to code changes since publication and randomized initialization, the results produced by these scripts may differ slightly from those presented in the paper). All of these scripts use the Parallelism toolbox to run multiple tests in parallel for efficiency, and call the saveresults function at the end to save the results from the test into a directory with a name corresponding to the test and the current time and date. Once this directory is created, the results can be graphed using the renderresults function, which takes the directory name as the first argument. The second argument specifies whether to only graph the results (1), or whether to also render the reward function of each test (0), which takes significantly longer. 3.4 Using human generated examples The directory Human_Demos contains some demonstrations of highway policies by a human user. To pass these examples to a test, set the "true_examples" field of the test parameters struct (last argument to runtest) to human_result.example_samples after loading the desired demo. Note that the test should be done using the same example parameters, which can be obtained from human_result.mdp_params. To generate your own human demonstrations, use the tools in HumanControl. Specifically, the runhumantrial function will bring up an interface for generating a human demonstration. This interface is currently "turn based", although a real time interface may be added in the future. The arguments for this function are the following: 1. Example name. 2. Example parameters. 3. Human trial parameters specifying the number and length of examples. See HumanControl/humantrialdefaultparams.m The function returns a human_result struct, which can be saved and used for future tests. 4. Directory overview This section presents an overview of the directories in the package, briefly describing each included IRL algorithm, MDP solver, and example domain. 4.1 IRL algorithms Note that, with the exception of FIRL and GPIRL, I cannot vouch for the accuracy of my implementations of prior algorithms. While I made a best faith effort to implement these methods correctly based on the authors' descriptions, these are my own implementations and should not be viewed as the definitive reference implementations of the prior algorithms. - AN - Abbeel & Ng's projection algorithm for inverse reinforcement learning. Note that, since Abbeel & Ng's algorithm does not return a single reward, this implementation is somewhat less accurate than the original algorithm, because a single reward function must be selected to return for evaluation. For details, see "Apprenticeship Learning via Inverse Reinforcement Learning" (Abbeel and Ng, ICML 2004) - FIRL - The Feature Construction for Inverse Reinforcement Learning algorithm. Only compatible with discrete features (the "continuous" parameter must be set to 0 for the current example environment). For details, see "Feature Construction for Inverse Reinforcement Learning" (Levine, Popovic, Koltun, NIPS 2010) - GPIRL - The Gaussian Process Inverse Reinforcement Learning algorithm. For continuous features, it is recommended to use the warped kernel by setting the "warp_x" parameter to 1. For details, see "Nonlinear Inverse Reinforcement Learning with Gaussian Processes" (Levine, Popovic, Koltun, NIPS 2011) - LEARCH - An implement of the "Learning to Search" algorithm. This is a "best effort" implementation, since the algorithm is quite complicated and has many free parameters. The current version can run either log-linear mode or nonlinear, with decision trees of logistic regression to create nonlinear features. For details, see "Learning to Search: Functional Gradient Techniques for Imitation Learning" (Ratliff, Silver, Bagnell, Autonomous Robots 27 (1) 2009) - MaxEnt - the maximum entropy IRL algorithm. For details, see "Maximum Entropy Inverse Reinforcement Learning" (Ziebart, Mass, Bagnell, Dey, AAAI 2008) - MMP - the maximum margin planning algorithm. Note that this implementation uses the QP formulation of MMP rather than subgradient methods, since the QP version is already very fast on the examples in this package. For details, see "Maximum Margin Planning" (Ratliff, Bagnell, Zinkevich, ICML 2006) - MMPBoost - MMP with feature boosting, using decision trees of a configurable depth. As with LEARCH, this is a "best effort" implementation, since the algorithm has many free parameters. For details, see "Boosting Structured Prediction for Imitation Learning" (Ratliff, Bradley, Bagnell, Chestnutt, NIPS 2007) - MWAL - The game-theoretic MWAL algorithm. For details, see "A Game-Theoretic Approach to Apprenticeship Learning" (Syed, Schapire, NIPS 2008) - OptV - A provisional implementation of the OptV algorithm. Since OptV learns value functions rather than reward functions, this algorithm cannot use the provided features, and instead simply learns a value for each state. This makes it unable to perform transfer, and produces (unfairly) poor results compared to the other methods. It should be noted that this is a symptom of the framework rather than the algorithm, since the toolkit is designed for testing algorithms that learn rewards rather than value functions. For details, see "Inverse Optimal Control with Linearly-Solvable MDPs" (Dvijotham, Todorov, ICML 2010) 4.2 MDP solvers The toolkit comes with two MDP solvers. The standard solver (StandardMDP) uses value iteration to recover a value function, which in turn specifies a deterministic optimal policy for the MDP. The linear MDP solver (LinearMDP) uses "soft" value iteration as described in Ziebart's PhD thesis, and corresponds to linearly-solvable MDPs (see Dvijotham & Todorov 2010). This model produces stochastic examples, which may be viewed as "suboptimal" under the standard MDP model. 4.3 Example domains Three example domains are included in this toolkit. They are described below: - Gridworld - This is an NxN gridworld, with 5 actions per state corresponding to moving in each direction and staying in place. The "determinism" parameter specifies the probability that each action will "succeed." If the action "fails," a different random action is taken instead. The reward function of the gridworld consists of BxB blocks of cells (forming a "super grid"). The features are indicators for x and y values being below various integer values. Note that this environment does not support transfer experiments. - Objectworld - This is a gridworld populated with objects. Each object has one of C inner and outer colors, and the objects are placed at random. There are 2C continuous features, each giving the Euclidean distance to the nearest object with a specific inner or outer color. In the discrete feature case, there are 2CN binary features, each one an indicator for a corresponding continuous feature being less than some value D. The true reward is positive in states that are both within 3 cells of outer color 1 and 2 cells of outer color 2, negative within 3 cells of outer color 1, and zero otherwise. Inner colors and all other outer colors are distractors. - Highway - This is the highway environment described in "Nonlinear Inverse Reinforcement Learning with Gaussian Processes." Note that the number of cars on the highway must be specified manually. There is a somewhat hacky algorithm for placing cars that avoids creating roadblocks. This algorithm will stall if it cannot be placed the specified number of cars, so don't specify too many. This may be fixed in a future release. It is also possible to specify more than 2 categories or classes of cars, though this functionality is currently untested. 4.4 General The following are miscallaneous directories in the toolkit: - Evaluation - This directory contains implementations of the various metrics that can be used to evaluate the IRL result. Each function is a particular metric. - General - These are general testing scripts, including runtest, runtransfertest, etc. - HumanControl - Interface for generating human demonstrations. - NIPS11_Tests - Scripts for reproducing test results from "Nonlinear Inverse Reinforcement Learning with Gaussian Processes." - Testing - Scripts for graphing, saving, and visualizing test results from series tests. - Utilities - External utilities, including minFunc and CVX. 5. Calling conventions The IRL toolkit is intended to be extendable and modular. This section describes the calling convention for various modules, in order to describe how additional components can be added. So long as the new component implements the specified functions, it will be usable with the toolkit. 5.1 IRL algorithms IRL algorithms should must implement two functions: <name>run and <name>transfer, where <name> is the name of the algorithm that will be used to identify it when calling runtest. Additionally, all current algorithms implement <name>defaultparams, but this is a convenience. The run function must accept the following arguments: 1. algorithm_params - The parameters of the algorithm. Some parameters may be missing, so it is advisable to implement a <name>defaultparams function to fill them in (for example, see gpirldefaultparams.m). 2. mdp_data - A specification of the example domain. mdp_data includes the number of states, the number of actions, the transition function (specified by sa_s and sa_p), and so forth. For details, see (for example) gridworldbuild.m 3. mdp_model - The name of the current MDP model (standardmdp or linearmdp). The final returned policy must be computed from the reward function using this model, but most algorithms will not use this parameter anywhere else. 4. feature_data - Information about the features. The main paramater of this struct is feature_data.splittable (forgive the "legacy" name), which is a matrix containing the value of each feature at each state. 5. example_samples - A cell array of the examples, where exmaple_samples{i,t} is time step t along trajectory i. Each entry in the cell array has two numbers: example_samples{i,t}(1) is the state, and example_samples{i}{t}(2) is the action. 6. true_features - The true features that form a linear basis for the reward. Most algorithms (that are not cheating) will ignore this, but it may be useful for establishing a baseline comparison. 7. verbosity - The desired amount of output. 0 means no output, 1 means "medium" output, and 2 means "verbose" output. Additionally, IRL algorithms must return the struct irl_result containing the result of the computation. This struct has the following fields: 1. r - the learned reward function, with mdp_data.states rows and mdp_data.actions columns 2. v - the resulting value function, returned by <mdp_model>solve 3. p - the corresponding policy (as above) 4. q - the corresponding q function (as above) 5. r_itr - the reward at each iteration of the algorithm, useful for debugging; if unused, set to {{r}} 6. model_itr - the model at each iteration of the algorithm, to be used for transfer 7. model_r_itr - a second optional reward at each iteration of the algorithm, useful for debugging; if unused, set to r_itr 8. p_itr - the policy at each iteration; if unused, set to {{p}} 9. model_p_itr - a second optional policy at each iteration of the algorithm, useful for debugging; if unused, set to p_itr 10. time - the time the algorithm spent computing the reward function (optional, set to 0 if unused) The transfer function must use the learned model to transfer the reward function to a different state space. The output of this function is identical to <name>run. The arguments are the following: 1. prev_result - The irl_result structure returned by a <name>run call. 2. mdp_data - MDP definition for the new state space. 3. mdp_model - MDP model, as before. 4. feature_data - Feature data for the new state space. 5. true_feature_map - True features for the new state space (again, not used except when "cheating"). 6. verbosity - As before. If both <name>run and <name>transfer are implemented, the algorithm should be usable in all tests without further modification. 5.2 MDP solvers Additional MDP solvers can also be added to the framework, for example for solving problems with continuous state spaces. The MDP solver must implement 5 functions: <name>solve, <name>frequency, <name>action, <name>step, and <name>compare. The function <name>solve must find the solution to the specified MDP. It has two arguments: mdp_data, the MDP definition, and the states x actions reward r. The definition mdp_data is constructed by the example building function (see 5.3). The output is a struct "mdp_solution", which has fields "v", "q", and "p", corresponding to the value function, Q function, and policy. Note that these fields will only be used by other functions relating to your MDP solver, so they can contain whatever information you choose. The function <name>frequency finds the visitation frequency of all states in the MDP. This function is used by many metrics. The inputs are mdp_data and mdp_solution (from a previous call to <name>solve), and the output is a column vector with 1 entry for each state, giving that state's expected visitation count under the policy in mdp_solution. The functions <name>action and <name>step sample an action at a particular state and execute that action, respectively. <name>action takes as input mdp_data, mdp_solution, and a state s. The output is an action a, which is either the optimal action under a deterministic policy or a sampled action under a stochastic policy. The function <name>step takes mdp_data, mdp_solution, a state s, and an action a, and returns the resulting state after taking action a in state s, which may be sampled if the MDP is nondeterministic. Finally, the function <name>compare compares two policies, given as arguments p1 and p2, and returns the amount of discrepancy between them. In the case of a standard MDP, this is currently the number of states in which the policies disagree. In the case of the linear MDP, this is the sum over all states of the probability that p1 and p2 will take different actions. This function is only used when evaluating some metrics. 5.3 Example domains Example domains are only required to implement two functions: <name>build and <name>draw. The building function has a single argument - the mdp_params structure passed to runtest that specifies the parameters of the desired example. The outputs are the following: 1. mdp_data - A structure containing the definition of the MDP. It must specify the following fields: - states - the number of states - actions - the number of actions in each state - discount - the discount factor of the MDP (currently all examples are infinite horizon discounted reward) - sa_s - 3D matrix of size states x actions x K, where K is any number. The entry sa_s(s,a,k) specifies the state to transition to when taking action a in state s, and sampling destination k (see below) - sa_p - 3D matrix as above, where sa_p(s,a,k) gives the probability of transitioning to sa_s(s,a,k) on action a in state s 2. r - A states x actions matrix specifying the true reward function of the example environment. 3. feature_data - A struct containing two fields regarding the features of the example: - splittable a states x features matrix containing the value of each feature at each state - stateadjacency - A sparse states x states matrix with "1" for each pair of states that are "adjacent" (have an action to transition from one to the other); this is currently only used by FIRL. 4. true_feature_map - A (sparse) states x features matrix that gives the values of the "true" features that can be linearly combined to obtain r. The drawing function has no output, and takes the following input arguments: 1. r - The reward function to draw. 2. p - The policy to draw (may be deterministic or stochastic). 3. mdp_params - The mdp_params struct passed to <name>build. 4. mdp_data - The mdp_data struct returned by <name>build. 5. feature_data - The feature_data struct returned by <name>build. 6. model - The MDP model (standardmdp or linearmdp). The drawing function must draw some visualization of the specified example to the current axes. 6 License The license included below governs the terms of use for this software. Please direct any correspondence regarding the software to svlevine@cs.stanford.edu. The IRL toolkit is created by Sergey Levine, copyright 2011. If you are using the software in an academic publication, you may cite it as "IRL Toolkit" with a reference to the webpage from which it was obtained. Copyright (c) 2011, Sergey Levine All rights reserved. This software is made available under the Creative Commons Attribution-Noncommercial License, viewable at http://creativecommons.org/licenses/by-nc/3.0/. You are free to use, copy, modify, and re-distribute the work. However, you must attribute any re-distribution or adaptation in the manner specified below, and you may not use this work for commercial purposes without the permission of the author. Any re-distribution or adaptation of this work must contain the author's name (Sergey Levine) and a link to the software's original webpage. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
About
IRL Toolkit developed by Sergey Levine (Taken from https://graphics.stanford.edu/projects/gpirl/)
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published