Abstract :
Optimal control methods have been around for quite some time with the most recent proposed methods (iLQR, DDP, SAC) been developed only recently. What is impressive about these methods is their use in robotics which has led to profound and almost human-like abilities in robot motions. In this article, I will formulate the optimal control problem and present one of the most general method for optimal control commonly referred to as the shooting method. I will also present a simple code-base using python
and Jax
for automatic differentiation.
Before we get into the optimal control part, we need to set up the robot dynamics and how we are introducing control input to the robot.
Let's consider a general continuous time dynamical system (a.k.a. robot) with input of the form
where
where the integration starts at some initial condition
Our first step is to import the libraries to create the dynamics that we will be working with.
import jax.numpy as np # imports numpy for vector-matrix math
from jax import grad, jit, vmap, jacfwd # all the derivative and gradients
from tqdm import tqdm, trange
# %matplotlib notebook
import matplotlib.pyplot as plt
For our first example, we are going to use the following nonlinear dynamical system:
where jacfwd
and the jit
functions to get the forward jacobian and then speed up the computation with jit
.
@jit
def f(x, u):
A = np.array([[-0.1, -1.0],
[1.0, -0.1]])
B = np.array([[1.0, 0.],
[0., 1.0]])
non_lin_x = np.array([x[0]*x[0]*x[0], x[1]])
return np.dot(A, non_lin_x) + np.dot(B, u)
# get the jacobians
dfdx = jit(jacfwd(f,0))
dfdu = jit(jacfwd(f,1))
Now that we have defined this dynamical system let's define the objective (a.k.a. the task)! For those familiar with reinforcement learning, this objective is equivalent to a reward function which assigns a value to the state and control. Here, we use a cost function (negative reward) that we have full knowledge of to define the task for the robot.
Let
where we add the dynamics and the initial condition as a constraint to the objective function
For the previously defined dynamical system, the following running cost and terminal cost are
The control sequence
To write this objective down, we will use the grad
function with jit
to get the first order derivatives of
@jit
def ell(x, u):
return np.dot(x, x) + 1e-3 * np.dot(u, u)
@jit
def m(x):
return np.dot(x, x)
dldx, dldu = jit(grad(ell, 0)), jit(grad(ell, 1))
dmdx = jit(grad(m))
While it is possible to solve the optimal control problem drectly, discretizing in time and imposing the dynamic constriants will ultimately increase computational costs as the discretization becomes finer and the dynamics more complex. Therefore, we will solve this optimal control problem indirectly using the Maximum Principle to solve a set of conditions which are met when there is a minima to the objective function. The way we are going to acquire these conditions is through the variational derivative.
Assume that
Since
where
Note that the term on the right can be evaluated using integration by parts $$ -\int_{t_0}^{t_f} \rho^\top \delta\dot{x} dt = - \rho(t_f)^\top \delta x(t_f) + \rho(t_0)^\top \delta x(t_0) + \int_{t_0}^{t_f} \dot{\rho}^\top \delta x dt $$
therefore the variation
What is fascinating about
Note that the last two conditions define the following differential equation
which is solved backwards in time (starting from the boundary condition
Now if you think deeply about the set of equations, we can see that these conditions make a pretty powerful statement! We can propose a sequence of controls
Here is where we get to the interesting part (and where all the derivative terms we initially constructed come into play. Let's try to solve the system of equations given an initial guess of
To solve this kind of problem you first need to see what the resulting trajectory
The iterative algorithm goes as follows:
Given
- forward simulate
$x_0$ using$\dot{x} = f(x, u)$ for time$t_0 \to t_f$ . (forward pass) - evaluate
$m_x(x(t_f))$ and simulate the Lagrange multiplier$\rho(t)$ backwards in time starting from$\rho(t_f) = m_x(x(t_f))$ (backward pass) - at each time step, let
$\delta u = - \gamma ( \ell_u + f_u^\top \rho)$ where$\gamma$ is the step size for the contol gradient - let
$u = u + \delta u$
- repeat until
$\delta u$ is near zero.
We will code this algorithm in two parts, the forward pass and the backward pass that takes in the derivative terms that we calculated and the candidate set of controls.
def forward_pass(x_0, u, dt=0.01):
x = []
x_t = x_0.copy()
cost = 0.0
for t, u_t in enumerate(u):
x.append(x_t)
cost += ell(x_t, u_t) * dt
x_t = x_t + f(x_t, u_t) * dt
return x, cost, x_t
def backward_pass(x, u, x_f, dt=0.01, step_size=1e-1):
u_star = []
rho = dmdx(x_f)
for x_t, u_t in zip(x[::-1], u[::-1]):
rho_dot = - dldx(x_t, u_t) - np.dot(dfdx(x_t, u_t).T, rho)
rho = rho - rho_dot * dt
du = dldu(x_t, u_t) + np.dot(dfdu(x_t, u_t).T, rho)
u_star.append(u_t - step_size * du)
return u_star[::-1]
We are going to first initialize some parameters and also discretize time (we are going to use Euler integration, but any integration scheme will work)
# some parameters
max_iter = 100
tf = 2
dt = 0.1 # time step
N = int(tf/dt)
threshold = 0.001
step_size = 1e-2
# initialize the control with a random sequence of actions
u = [np.array([-0.1, -0.1]) for t in range(N)]
x0 = np.array([1.,2.]) # initial condition
/home/burn/.local/lib/python3.6/site-packages/jax/lib/xla_bridge.py:122: UserWarning: No GPU/TPU found, falling back to CPU.
warnings.warn('No GPU/TPU found, falling back to CPU.')
trajectories = []
costs = []
with tqdm(total=max_iter) as pbar:
for k in range(max_iter):
x, cost, x_f = forward_pass(x0, u, dt)
u = backward_pass(x, u, x_f, dt, step_size)
cost += m(x_f)
pbar.set_description('Current Cost: {}'.format(cost))
pbar.update()
if k % 20 == 0:
trajectories.append(x)
costs.append(cost)
Current Cost: 1.5446460247039795: 100%|██████████| 100/100 [00:06<00:00, 14.97it/s]
Let's visualize what this looks like!
fig, axs = plt.subplots(1, 2, figsize=(10,4))
for i, x in enumerate(trajectories):
x = np.stack(x)
axs[0].plot(x[:,0], x[:,1], label='iter {}'.format(i*10))
axs[1].plot(np.stack(costs), label='iter {}'.format(i*10))
axs[0].set_xlabel('x')
axs[0].set_ylabel('y')
axs[0].set_title('x-y trajectory')
axs[0].legend()
axs[1].set_xlabel('iteration x 10 ')
axs[1].set_ylabel('cost')
axs[1].set_title('cost over iteration')
plt.show()
We can see that each step works towards improving the cost function and completing the task. You can use this formation for almost any problem so long as it is not an impossible task and the system has continuous derivatives. Many of the most common optimal controllers and trajectory planners (like iLQR) work in similar manner with a forward and backward pass
Let's updating system with a different task. Here, we will use the planar vertical takeoff-and-landing vehicle (vtol) with the task of doing a backflip. The dynamics of the vtol are given by
where
Let's get this vtol to flip which we will define as the following objective function
where
@jit
def f(s, a):
eps = 0.5
x, y, th, xt, yt, tht = s
u1, u2 = a
xtt = - u1 * np.sin(th) + eps * u2 * np.cos(th)
ytt = u1 * np.cos(th) + eps * u2 * np.sin(th) - 9.81
thtt = u2
return np.array([xt, yt, tht, xtt, ytt, thtt])
@jit
def ell(x, u):
return 1e-4 * np.dot(u, u)
@jit
def m(x):
target = np.array([4, 0., 2.0 * np.pi, 0., 0.0, 0.])
weights = np.array([1.0, 1.0, 20., 0.1, 0.1, 20.])
dx = x - target
return np.sum(np.square(dx) * weights)
dfdx, dfdu = jit(jacfwd(f, 0)), jit(jacfwd(f, 1))
dldx, dldu = jit(grad(ell)), jit(grad(ell, 1))
dmdx = jit(grad(m))
We reinitialize the environment and updated the parameters and the initial seed
# some parameters
max_iter = 1000
dt = 0.1
tf = 3
N = int(tf/dt)
threshold = 0.01
step_size = 3e-4
# initialize the control with a random sequence of actions
u = [np.array([15., .2]) for t in range(N)]
x0 = np.array([0., 0., 0., 0., 0., 0.])
trajectories = []
costs = []
with tqdm(total=max_iter) as pbar:
for k in range(max_iter):
x, cost, x_f = forward_pass(x0, u, dt)
u = backward_pass(x, u, x_f, dt, step_size)
cost += m(x_f)
if k % 20 == 0:
trajectories.append(x)
costs.append(cost)
pbar.set_description('Current Cost: {:.4}'.format(cost))
pbar.update(20)
Current Cost: 4.869: 100%|██████████| 1000/1000 [01:25<00:00, 11.65it/s]
This part is optional, but if you wanted to see a visualization of the vtol backflip, checkout the jupyter notebook and run the following two cells.
from vtol import VTOLEnv
import time
env = VTOLEnv()
env.reset()
for u_t in u:
env.step(u_t)
env.render()
time.sleep(1/10)
Here we are going to visualize the trajectories from sim and plot the cost over each iteration.
fig, axs = plt.subplots(1, 2, figsize=(10,4))
for i, x in enumerate(trajectories[1::10]):
x = np.stack(x)
axs[0].plot(x[:,0], x[:,1], label='iter {}'.format(i*20*10+1))
axs[1].plot(np.stack(costs), label='iter {}'.format(i*10))
axs[0].set_xlabel('x')
axs[0].set_ylabel('y')
axs[0].set_title('x-y trajectory')
axs[0].legend()
axs[1].set_xlabel('iteration x 20 ')
axs[1].set_ylabel('cost')
axs[1].set_title('cost over iteration')
plt.show()
With the popularity of sim-to-real methods I thought it might be interesting to show how model-mismatch would affect sim-to-sim transfer of this optimal control technique as this was common practice in earlier robotics research. Since we have the vtol environment, we can update the eps
property in the environment class changing the
env.eps = 0.5
env.reset()
exact_trajectory = []
for u_t in u:
state = env.step(u_t)
exact_trajectory.append(state)
env.render()
time.sleep(1/10)
env.eps = 0.8
env.reset()
mm_trajectory = [] # model mismatch
for u_t in u:
state = env.step(u_t)
mm_trajectory.append(state)
env.render()
time.sleep(1/10)
mm_trajectory = np.stack(mm_trajectory)
exact_trajectory = np.stack(exact_trajectory)
plt.plot(mm_trajectory[:,0], mm_trajectory[:,1], label='model mismatch')
plt.plot(exact_trajectory[:,0], exact_trajectory[:,1], label='exact model')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
You can see that the slight deviation from the exact model causes the generated trajectories to deviate which results in the robot ending up at a different target point. Often in earlier robotics research, this model-mismatch was often solved using system identification to improve the quality of the model using real-world data. Although, often it is not possible to get such data without having the robot execute dangerous movements that may break. This will be the talking point for the next post on receding horizon control! Hope you enjoyed the post and if you find this helpful in anyway please cite me and follow my research!
Please Cite this if you've found it useful and interesting!
@article{abraham2020indirectoptimalcontrol,
title = "Introduction to Indirect Optimal Control",
author = "Abraham, Ian",
journal = "i-abr.github.io/bloggy-blog",
year = "2020",
url = "https://i-abr.github.io/bloggy-blog/"