Kalman Folding 3: Derivations (Review Draft)

1 Abstract

In Kalman Folding, Part 1,^klf1 we present basic, static Kalman filtering as a functional fold, highlighting the unique advantages of this form for deploying test-hardened code verbatim in harsh, mission-critical environments. The examples in that paper are all static, meaning that the states of the model do not depend on the independent variable, often physical time.

Here, we present mathematical derivations of the basic, static filter. These are semi-formal sketches that leave many details to the reader, but highlight all important points that must be rigorously proved. These derivations have several novel arguments and we strive for much higher clarity and simplicity than is found in most treatments of the topic.

2 Kalman Folding

In Kalman Folding, Part 1,^klf1 we found the following small formulation for the accumulator function of a fold that implements the static Kalman filter:

\noindent where

\noindent and all quantities are matrices:

$\mathbold{z}$ is a ${b}×{1}$ column vector containing one multidimensional observation
$\mathbold{x}$ is an ${n}×{1}$ column vector of model states
$\mathbold{Z}$ is a ${b}×{b}$ matrix, the covariance of observation noise
$\mathbold{P}$ is an ${n}×{n}$ matrix, the theoretical covariance of $\mathbold{x}$
$\mathbold{A}$ is a ${b}×{n}$ matrix, the observation partials
$\mathbold{D}$ is a ${b}×{b}$ matrix, the Kalman denominator
$\mathbold{K}$ is an ${n}×{b}$ matrix, the Kalman gain

In physical or engineering applications, these quantities carry physical dimensions of units of measure in addition to their matrix dimensions as numbers of rows and columns. If the physical and matrix dimensions of $\mathbold{x}$ are $\left[\left[\mathbold{x}\right]\right] \stackrel{\text{\tiny def}}{=} (\mathcal{X}, n×{1})$ and of $\mathbold{z}$ are $\left[\left[\mathbold{z}\right]\right] \stackrel{\text{\tiny def}}{=} (\mathcal{Z}, b×{1})$, then

Dimensional arguments, regarding both matrix dimensions and physical dimensions, are invaluable for checking the derivations that follow.

3 Derivations

Below, we derive equations \ref{eqn:kalman-cume-definition}, \ref{eqn:kalman-gain-definition} and \ref{eqn:kalman-denominator-definition}. Again, these derivations are just sketches designed for clarity as opposed to rigorous proofs. These derivations only cover the static Kalman filter, where $\mathbold{x}$ are fixed, constant, static states of the model. See Bar-Shalom^bars for derivations of the Kalman filter with time-dependent states and part 2 of this series^klf2 for an example.

The plan is first to develop expressions for the prior estimate $˜{\mathbold{x}}$ and prior covariance $˜{\mathbold{P}}$, and then expressions for the posterior versions $\hat{\mathbold{x}}$ and $\hat{\mathbold{P}}$, defining the Kalman gain $\mathbold{K}$ matrix and the denominator matrix $\mathbold{D}$ along the way. Finally, we derive the particular, convenient expressions for $\mathbold{K}$ and $\mathbold{D}$ that appear in equations \ref{eqn:kalman-cume-definition}, \ref{eqn:kalman-gain-definition}, and \ref{eqn:kalman-denominator-definition}. Bierman laid out this strategy for the derivation in his classic book Factorization Methods for Discrete Sequential Estimation.^bier We follow his plan, unpacking many of his elided steps for greater clarity.

3.1 Notation

The word vector alone means column vector by default. If a quantity is a row vector, we explicitly say so. In general, lower-case boldface symbols like $\mathbold{x}$ denote column vectors. Row vectors include a superscript transpose symbol, as in $\mathbold{a}^\intercal$. We write literal vectors in square brackets, as in $\left[a, b, \ldots\right]^\intercal$ for a column vector or $\left[a, b, \ldots\right]$ for a row vector or for cases where we don’t care whether it’s a column or row.

Upper-case boldface symbols like $\mathbold{M}$ denote matrices. Our matrices are not always square. Because vectors are special cases of matrices, some matrices are also vectors. We may use an upper-case symbol to denote a vector, but we do not use a lower-case symbol to denote a non-vector matrix.

Juxtaposition, as in $\mathbold{A}\,\mathbold{x}$ or $\mathbold{A}\,\mathbold{B}$, means matrix multiplication. When we write a product like $\mathbold{A}\,\mathbold{B}$, we assume that the number of columns of $\mathbold{A}$ matches the number of rows of $\mathbold{B}$.

Matrix multiplication is non-commutative, meaning that $\mathbold{A}\,\mathbold{B}$ does not, in general, equal $\mathbold{B}\,\mathbold{A}$. However, if a matrix $\mathbold{D}$ is square and diagonal, meaning that it has non-zero entries only along its main diagonal from upper left to lower right, then $\mathbold{A}\,\mathbold{D}$ does always equal $\mathbold{D}\,\mathbold{A}$. We may freely use this fact without mentioning it explicitly.

Symmetric matrices do not always commute, even with other symmetric matrices. In particular, the product of two symmetric matrices is not always symmetric, as witnessed by the following counterexample:

Matrix multiplication is associative, meaning that the order in which pairwise multiplications is carried out does not matter. Thus

\[(\mathbold{A}\,\mathbold{B})\,\mathbold{C}=\mathbold{A}\,(\mathbold{B}\,\mathbold{C})=\mathbold{A}\,\mathbold{B}\,\mathbold{C}\]

\noindent and we don’t need to write parentheses. That means some expressions of long products can be hard to read. We occasionally use a center dot or $×$ symbol to make multiplication easier to read, as in $\mathbold{A}⋅\mathbold{x}$ or $\mathbold{A}×\mathbold{x}$. We also use the $×$ symbol when discussing the numbers of rows and columns of a matrix, as in “$\mathbold{A}$ is an $m× n$ matrix,” meaning that $\mathbold{A}$ has $m$ rows and $n$ columns.

We may freely exploit the following facts without mentioning them explicitly:

For any matrix $\mathbold{M}$, $\left(\mathbold{M}^\intercal\right)^\intercal = \mathbold{M}$
For any invertible matrix $\mathbold{M}$, $\left(\mathbold{M}^-1\right)^-1 = \mathbold{M}$
For any two matrices $\mathbold{A}$ and $\mathbold{B}$, $\left(\mathbold{A}\,\mathbold{B}\right)^\intercal=\mathbold{B}^\intercal\mathbold{A}^\intercal$
$\left(\mathbold{A}\,\mathbold{B}\right)^-1=\mathbold{B}^-1\mathbold{A}^-1$ when the matrices are invertible
$\mathbold{P}^\intercal$ = $\mathbold{P}$ if and only if $\mathbold{P}$ is symmetric

For any matrix $\mathbold{M}$, $\mathbold{M}^2$ means $\mathbold{M}^\intercal\mathbold{M}$, the transpose of the matrix times the matrix. Such squared matrices are always square and symmetric. This notation pertains to vectors, as well, because they are just special cases of matrices. Thus, $\mathbold{x}^2=\mathbold{x}^\intercal\mathbold{x}$, the square of the Euclidean $\mbox{2-\textrm{norm}}$ of $\mathbold{x}$, a scalar; and $(\mathbold{x}^\intercal)^2 = (\mathbold{x}^\intercal)^\intercal\cdot \mathbold{x}^\intercal= \mathbold{x}\,\mathbold{x}^\intercal$ is the outer product of $\mathbold{x}$ with itself; that outer product is an $n×{n}$ square, symmetric matrix, where $n$ is the dimensionality of $\mathbold{x}$.

When $\mathbold{M}^2$ is invertible, $\mathbold{M}^-2$ means the inverse of $\mathbold{M}^2$, namely $\left(\mathbold{M}^\intercal\mathbold{M}\right)^-1$.

We use the term tall to mean a matrix with more rows than columns, that is, an $m×{n}$ matrix when $m>n$. When discussing $m×{n}$ matrices, we usually assume that $m>n$. We use the term wide to mean a matrix with more columns than rows, as in an $n×{m}$ matrix. We use the term small to mean $n×{n}$, and large to mean $m×{m}$.

3.1.1 Probability and Statistics

We use the terms distribution and expectation value without definition in this paper. If $\mathbold{x}$ is a random variable, then we denote the expectation value of some function $f$ of $\mathbold{x}$ as $E[f(\mathbold{x})]$.

3.2 Definitions

$t$: is the independent variable. In many applications, $t$ represents physical time, or an integer index mapped to physical time. It is known and non-random. We treat it as a scalar, here, though it is possible to extend the theory to a vector $t$.
$\mathbold{x}$: is the (column) vector of $n$ unknown, constant states of the model. It’s a random variable, and we compute estimates and covariances via expectation values over its distribution. This symbol also means an algebraic variable standing for some particular estimate of the states.
$\mathbold{A}\,\mathbold{x}$: is the model; it predicts an observation at time $t$ given an estimate of the states $\mathbold{x}$ and a current partials matrix $\mathbold{A}$ that may depend on $t$. The model is a column vector of dimensionality $b×{1}$, the same as the dimensionality of an observation $\mathbold{z}$.
$\mathbold{A}$: is the current partials matrix, the partial derivative of the model with respect to the unknown states $\mathbold{x}$, evaluated at the current value of the independent variable $t$. We could write $\mathbold{A}$ as $\mathbold{A}(t)$; it’s an aesthetic judgment to omit explicit $t$ dependence because it would make the derivations longer and harder to read. Because the model is linear, the partials do not depend on $\mathbold{x}$. $\mathbold{A}$ is known, non-random, and may depend on $t$. Generally, its dimensionality is $b×{n}$, where $b$ is the dimensionality of an observation $\mathbold{z}$.
$˜{\mathbold{A}}$: is the prior partials matrix, a matrix that stacks all the prior rows of $\mathbold{A}$ that precede the current row. It is known, non-random, and $m b×{n}$, where $m$ is the number of prior observations, $b$ is the dimensionality of a single observation $\mathbold{z}$, and $n$ is the dimensionality of the states $\mathbold{x}$. Thus $˜{\mathbold{A}}$ is tall in the typical overdetermined case where $m>n$, more observations than states. We do not actually realize $˜{\mathbold{A}}$ in computer memory because Kalman keeps all information in the running covariance matrix. $˜{\mathbold{A}}$ is just a useful abstraction for the derivations below.
$\mathbold{z}$: is the current observation. It is known and non-random. Its dimensionality is $b×{1}$.
$˜{\mathbold{z}}$: is a stack of all prior observations. It is known, non-random, $m b×{1}$. It’s a useful abstraction in the derivations below. It’s not necessary to actually realize it in computer memory because we use all its information incrementally by folding.
${˜{\mathbold{x}}}$: the prior estimate, the estimate of $\mathbold{x}$ given all information we have prior to the current observation. It is known, non-random, $n×{1}$.
${\hat{\mathbold{x}}}$: the posterior estimate, the estimate of $\mathbold{x}$ given (1) the prior estimate ${˜{\mathbold{x}}}$, (2) the current partials $\mathbold{A}$, and (3) the current observation $\mathbold{z}$. It is known, non-random, $n×{1}$. It satisfies the Kalman update equation:

\noindent which is equivalent to the recurrence $\mathbold{x}←\mathbold{x}+\mathbold{K}\,(z-\mathbold{A}\,\mathbold{x})$ used in part 1 of this series.

${˜{\mathbold{P}}}$: covariance of the priors, equals ${˜{\mathbold{A}}}^-2$ (de-dimensionalized; proof sketch below). This is called just $\mathbold{P}$ in part one of this series. It is known, non-random, $n×{n}$.
${\hat{\mathbold{P}}}$: posterior covariance, satisfies ${\hat{\mathbold{P}}}\, {\mathbold{A}}^\intercal= \mathbold{K}= {˜{\mathbold{P}}}\,\mathbold{A}^\intercal\,\mathbold{D}^-1$ (de-dimensionalized; proof sketch below). We calculate it from the prior covariance $˜{\mathbold{P}}$ and the new partials matrix $\mathbold{A}$. It is known, non-random, $n×{n}$.
$\mathbold{A}\,{˜{\mathbold{x}}}$: the predicted observation given the prior estimate ${˜{\mathbold{x}}}$ and the current partials matrix $\mathbold{A}$. It is a particular evaluation of the model. It is known, non-random, $b×{1}$.
$\mathbold{z}-\mathbold{A}\,{˜{\mathbold{x}}}$: the measurement residual, the difference between the current observation $\mathbold{z}$ and the predicted observation $\mathbold{A}\,{˜{\mathbold{x}}}$.
$\mathbold{ζ}$: observation noise: random column-vector with zero mean and covariance $\mathbold{Z}$ (unity, $\mathbold{1}$, after de-dimensionalization). It has $b$ rows and $1$ column, like $\mathbold{z}$.
$\mathbold{Z}$: covariance of the observation noise, $E \left[ \mathbold{ζ}\, \mathbold{ζ}^\intercal \right]$: known, non-random $b×{b}$.
$˜{\mathbold{z}} = ˜{\mathbold{A}}\,{\mathbold{x}} + \mathbold{ζ}$: the observation equation, which equates $˜{\mathbold{z}}$, the stack of all prior observations, to the product of $˜{\mathbold{A}}$, the stack of all prior partials matrices, and an unknown random vector of states, $\mathbold{x}$, plus some unknown random observation noise $\mathbold{ζ}$. The stack of prior observations $˜{\mathbold{z}}$ is known, non-random, $m b×{1}$; the stack of prior partials matrices $˜{\mathbold{A}}$ is known, non-random, $m b×{n}$; the state vector ${\mathbold{x}}$ is unknown, random, $n×{1}$; The noise vector $\mathbold{ζ}$ is unknown, random, $m b×{1}$. The observation equation looks similar to the expression for the residual above. It’s worthwhile to take a little time to examine the notations carefully and make sure that you have a good mental picture of the meanings of these notations. The observation equation looks tall in the typical, overdetermined case, where as the residual is usually equivalent to a scalar expression.
$\mathbold{K}$: Kalman gain $= {˜{\mathbold{P}}}\, \mathbold{A}^\intercal\, {\mathbold{D}}^-1$ (proof sketch below). Non-random, $n×{b}$.
$\mathbold{D}$: Kalman denominator $= \mathbold{Z}+ \mathbold{A}\, {˜{\mathbold{P}}}\, \mathbold{A}^\intercal$, or $\mathbold{1}+ \mathbold{A}\, {˜{\mathbold{P}}}\, \mathbold{A}^\intercal$ de-dimensionalized. (proof sketch below). Non-random, $b×{b}$.

3.3 Demonstration that Prior Covariance ${˜{\mathbold{P}}} = ˜{\mathbold{A}}^-2$

The fact that the prior covariance, $˜{\mathbold{P}}$, equals the the inverse square of the stack of prior partials matrices (de-dimensionalized), $˜{\mathbold{A}}^-2$, is the secret to Kalman’s efficient, in fact constant, use of computer memory. The stack of prior partials matrices $˜{\mathbold{A}}$ can be very tall and impractical to store. But its square, $˜{\mathbold{A}}²$ is only $n×{n}$, and its inverse square is also just $n×{n}$. Kalman packs all statistical information about the model into this small matrix of constant size, and incrementally improves the statistics as observations accumulate, without increasing the size of the matrix, and thus without increasing the amount of computer memory needed to keep all important information. The Kalman filter is optimal, meaning that the small covariance matrices keep all available information. No other method would be able to squeeze more information out of the observations and the model — at least when the noise is Gaussian. A rigorous optimality proof is out of scope for this paper, but the least-squares derivation below contains the central idea: Kalman tracks the estimate and covariance that minimize the sum of squared residuals. Kalman is optimal in the sense that no other method would find a smaller sum of squared residuals.

3.3.1 Covariance of Any Random Vector Variable

The covariance of any random column vector $\mathbold{y}$ is defined as the expectation value $E \left[ \mathbold{y}\, \mathbold{y}^\intercal \right] = E \left[ ({\mathbold{y}^\intercal})^2 \right]$ \noindent This is the expectation value of an outer product of a column vector $\mathbold{y}$ and its transpose, $\mathbold{y}^\intercal$. Therefore, it is a $q×{q}$ matrix, where $q×{1}$ is the dimensionality of $\mathbold{y}$.

3.3.2 Prior Estimate ${˜{\mathbold{x}}}$

One of our random variables is $\mathbold{x}$, the column \mbox{$n$-vector} of unknown states. To calculate its estimate, assume we know the values of all $m$ past partials ${˜{\mathbold{A}}}$ (tall, $m b×{n}$) and observations $˜{\mathbold{z}}$ (tall, $m b×{1}$).

Relate $\mathbold{x}$ to the known observations ${˜{\mathbold{z}}}$ and the known partials ${˜{\mathbold{A}}}$ through the normally distributed random noise column vector $\mathbold{ζ}$ and the observation equation:

3.3.3 Sum of Squared Residuals

Consider the following performance functional, computed over the population of $\mathbold{x}$.

\noindent $J(\mathbold{x})$ is a scalar: the sum of squared residuals. A residual is a difference between an actual observation $\mathbold{z}$ and a predicted observation $\mathbold{A}\,\mathbold{x}$. An actual observation $\mathbold{z}$ is a known, concrete \mbox{$b$-vector} of numbers, and the partials matrix $\mathbold{A}$ is a known, concrete \mbox{$(b× n)$-matrix} of numbers corresponding to that observation. The observation equation

stacks all prior observations (known, concrete numbers) into $˜{\mathbold{z}}$
stacks all prior values of the partials matrix $\mathbold{A}$ into $˜{\mathbold{A}}$ (known, concrete numbers)
multiplies by the unknown random state estimate $\mathbold{x}$ to get the (unknown, random) predicted observations ${˜{\mathbold{A}}}\,\mathbold{x}$
finally adds some unknown random noise $\mathbold{ζ}$ (column vector of height $m b$)

The performance functional collapses all that information into a scalar random variable $J(\mathbold{x})$ with the same (Gaussian) distribution as the noise $\mathbold{ζ}$. Recall that any random variable is, in fact, always a function, even if only the identity function, as when we say that $\mathbold{x}$ is a random variable. This is the standard nomenclature of probability and statistics established by Kolmogorov, and it admittedly can be confusing.

The job of finding the optimal estimate of the state vector $\mathbold{x}$ is the job of finding the concrete, numerical value of $\mathbold{x}$ that minimizes the performance functional $J(\mathbold{x})$, which depends on all the known, non-random, concrete numbers in $˜{\mathbold{z}}$ and $˜{\mathbold{A}}$.

To find the $\mathbold{x}$ that minimizes $J(\mathbold{x})$, we could take the classic, school approach of setting to zero the partial derivatives of $J(\mathbold{x})$ with respect to $\mathbold{x}$ and solving the resulting equations for $\mathbold{x}$. The following is an easier way. Multiply the residuals across by the wide matrix ${˜{\mathbold{A}}}^\intercal$:

\noindent producing an \mbox{$n$-vector}, and then construct a modified performance functional:

\noindent $J(\mathbold{x})$ is minimum with respect to $\mathbold{x}$ if and only if (iff) $J’(\mathbold{x})$ is minimum (this assertion needs a rigorous proof; as warned, we present only sketches in this paper). Because $J’(\mathbold{x})$ is non-negative, when $J’(\mathbold{x})$ can be zero, its minimum must be zero. $J’(\mathbold{x})$ is zero iff ${˜{\mathbold{A}}}^2$, an $n×{n}$ square matrix, is invertible (non-singular), in which case

\noindent produces that minimum value of $J’(\mathbold{x})$, because then

We call such a solution for $\mathbold{x}$ the least-squares estimate of $\mathbold{x}$: the estimate of $\mathbold{x}$ based on all prior observations. From now on, we write it as ${˜{\mathbold{x}}}$

With this solution, we get a new expression for the performance functional $J(\mathbold{x})$ that is useful below. First note that

\noindent Equation \ref{eqn:aa2at-is-one} is another assertion that requires a rigorous proof, out of scope for this paper of sketches. But, assuming it is true, we have

\noindent This has physical dimensions $\mathcal{Z}^2$ where $\mathcal{Z}$ are the physical dimensions of the observations $\mathbold{z}$.

3.3.4 Prior Covariance $˜{\mathbold{P}}$

We now want the covariance of the residuals between our least-squares estimate $˜{\mathbold{x}}$ and the random vector $\mathbold{x}$:

\noindent Get $˜{\mathbold{x}}-\mathbold{x}$ from the observations and partials at hand as follows:

\noindent Now rewrite equation \ref{eqn:covariance-of-x}, the definition of the prior covariance $˜{\mathbold{P}}$:

\noindent We can collapse the expectation value inwards because the stack of observation partials $˜{\mathbold{A}}$ is a matrix of concrete, non-random numbers.

Noise $\mathbold{ζ}$ is Gaussian, normal, with diagonal covariance matrix $\mathbold{Z}$, by hypothesis. Equation \ref{eqn:almost-final-covariance} becomes

\noindent because $˜{\mathbold{A}}^-2$ is symmetric. At this point, no further simplification is possible, in general, because $\mathbold{Z}$ is $b× b$ and can only be sandwiched between ${˜{\mathbold{A}}}^\intercal$, $n× b$, and ${˜{\mathbold{A}}}$, $b× n$. However, we can greatly simplify this and all subsequent computations by de-dimensionalizing. There are numerical benefits, as well, to be discussed in the next section.

3.3.5 De-Dimensionalizing the Observation Equation

Fully spelled out, and in the general case of \mbox{$b$-vector} observations $\mathbold{z}$, one block of height $b$ of the observation equation is

If we divide each row $i$ by the standard deviation $σ_{z_i}$ of the \mbox{$i$-th} component $z_i$ of the observation $\mathbold{z}$, we get

The covariance of the noise $\mathbold{ζ}$, so normalized, is non-dimensional unity and equation \ref{eqn:prior-covariance-convenient-form} collapses completely to just

\noindent and the estimate of the priors, equation \ref{eqn:least-squares-estimate} now becomes

This is remarkable. All information about the covariance of the noise is pulled into the (new, normalized) observation partials.

I remember, when working in the early 1980’s at the Deep Space Network at JPL on direct measurement of tectonic drift,^jplg one difficulty was the wide disparity between uncertainties of horizontal measurments (right ascension and declination) and uncertainties in range. For instance, we knew the RA-dec position of the centroid of Saturn within 75 meters but its distance to no better than a million kilometers. That’s a disparity of seven orders of magnitude (the situation is greatly improved, now, due to the accumulation of range data for multiple spacecraft coupled with decades of orbital mechanics^folk). At the time, this meant that we had to deal with error ellipsoids that were long, thin needles, covariance matrices with components differing by up to fourteen orders of magnitude. That’s not practical with floating-point computer arithmetic. One mitigation was de-dimensionalizing or normalizing, as described here, which brings the uncertainties of all components of an observation into the same numerical range, near unity. Another mitigation was Square Root Information Filtering (SRIF), the subject of another paper in this series.

In any event, for all subsequent calculations in this paper, we assume that the observation equation has been normalized and that $\mathbold{Z}=\mathbold{1}$.

3.4 Posterior Estimate $\hat{\mathbold{x}}$ and Covariance $\hat{\mathbold{P}}$

To effect incremental updates of $\mathbold{x}$ and $\mathbold{P}$, we need the posterior estimate $\hat{\mathbold{x}}$ and covariance $\hat{\mathbold{P}}$ in terms of the priors $˜{\mathbold{x}}$, $˜{\mathbold{P}}$, and the new partials $\mathbold{A}$ and new observation $\mathbold{z}$, all of which are matrices of known, concrete, non-random numbers. This is exactly what our kalmanStatic function from equation \ref{eqn:kalman-cume-definition} does, of course, in functional form. We derive the posteriors from scratch to seek opportunities to define $\mathbold{K}$ and $\mathbold{D}$ and to radically shorten the expressions.

First, define a new performance functional $J_1(\mathbold{x})$ as the sum of the performance of the priors $˜{J}(\mathbold{x})$ from equation \ref{eqn:performance-functional-reformed}, now written with tildes overhead, and a new term $J_2(\mathbold{x})$ for the performance of the new data:

This time, I don’t have a handy trick for minimizing the performance functional. Let’s find the minimizing $\mathbold{x}$ the classic way: by solving $d\,J_1(\mathbold{x})/d\,\mathbold{x}=0$. The usual way to write a vector derivative is with the nabla operator $∇$, which produces gradient vectors from scalar functions.

The particular scalar function we’re differentiating is, of course, the new performance functional $J_1(\mathbold{x})= {˜{J}}(\mathbold{x})+ J_2(\mathbold{x})$. Because ${˜{\mathbold{A}}^2}$ is symmetric,

\noindent an \mbox{$n$-vector}, and we similarly compute the gradient of $J_2(\mathbold{x})$, which contains the new observation and partials:

\noindent another \mbox{$n$-vector}. We can solve the resulting equation for $\mathbold{x}$ on sight, writing the new solution — the new estimate — with an overhat. Be aware that that $\mathbold{A}$ is a wide matrix, in fact an \mbox{$n$-row} when $b=1$, a common case, and $\mathbold{A}^2$ is thus an outer product and an $n×{n}$ matrix.

Look how pretty this is. Equation \ref{eqn:estimate-of-the-priors} for the priors gave us the form $˜{\mathbold{x}}= ˜{\mathbold{P}}\, ˜{\mathbold{A}}^\intercal\,˜{\mathbold{z}}$, a covariance $˜{\mathbold{P}}$ times the prior observations $˜{\mathbold{z}}$ scaled by the prior partials, transposed, $˜{\mathbold{A}}^\intercal$. The new estimate $\hat{\mathbold{x}}$ has exactly the same form if we regard the first matrix factor $\left({˜{\mathbold{A}}}^2 + \mathbold{A}^2 \right)^-1$ as a covariance $\hat{\mathbold{P}}$ and if we regard all the priors ${˜{\mathbold{A}}}^2\,{˜{\mathbold{x}}}$ as a single scaled observation to add to the current scaled observation $\mathbold{A}^\intercal\,\mathbold{z}$. We may regard ${˜{\mathbold{A}}^2}\,˜{\mathbold{x}}$ as a scaled observation because equations \ref{eqn:prior-covariance-most-convenient-form} and \ref{eqn:estimate-of-the-priors} imply that ${˜{\mathbold{A}}^\intercal}\,˜{\mathbold{z}}={˜{\mathbold{A}}^2}\,˜{\mathbold{x}}$. We may view the second term above, $\mathbold{A}^\intercal\, \mathbold{z} + {˜{\mathbold{A}}}^2\, {˜{\mathbold{x}}}$, as $\mathbold{A}^\intercal\, \mathbold{z} + {˜{\mathbold{A}}}^\intercal\, {˜{\mathbold{z}}}$.

3.4.1 Posterior estimate, $\hat{\mathbold{x}}$

We must wrangle equation \ref{eqn:kalman-update-equation} from equation \ref{eqn:def-of-posterior-estimate}. Equation \ref{eqn:kalman-update-equation} is the recurrence we want, namely $\hat{\mathbold{x}}=˜{\mathbold{x}}+\mathbold{K}(\mathbold{z}-\mathbold{A}\,˜{\mathbold{x}})$, and equation \ref{eqn:def-of-posterior-estimate} is the recurrence we have, namely
$ \hat{\mathbold{x}} = \left( {˜{\mathbold{A}}}^2 + \mathbold{A}^2 \right)^-1\, \left( \mathbold{A}^\intercal\, \mathbold{z} + {˜{\mathbold{A}}}^2\, {˜{\mathbold{x}}} \right) $.

First, formally define the new, posterior covariance.

\noindent Now write equation \ref{eqn:def-of-posterior-estimate} as

The form above strongly suggests that we define

\noindent yielding

\noindent Now, to get the recurrence we want

\noindent we need only set equation \ref{eqn:first-part-of-gain-proof} equal to equation \ref{eqn:second-part-of-gain-proof}. Cancelling terms and rearranging, we get

\noindent by definition of the prior covariance, equation \ref{eqn:prior-covariance-most-convenient-form}. For arbitrary $˜{\mathbold{x}}$, this will be true if

\noindent Rearrange and right-multiply by $˜{\mathbold{P}}$ to get

\noindent showing that equations \ref{eqn:recurrence-to-prove} and \ref{eqn:kalman-update-equation} are just alternative expressions for the same thing.

Let’s write this more compactly

\noindent where

\noindent and we have one of the three equivalent recurrences for the posterior covariance from the first paper in this series

3.4.2 A Gain Matrix $\mathbold{K}$ We Can Actually Compute

Of course, the gain matrix $\mathbold{K}$ is formally defined in terms of the posterior covariance, that is, as $\hat{\mathbold{P}}\,\mathbold{A}^\intercal$, but we don’t have the posterior covariance $\hat{\mathbold{P}}$ by equation \ref{eqn:p-is-l-p} until we have the gain matrix $\mathbold{K}$. To get out of this fix, we note that

\noindent and solve for $\mathbold{K}$:

Defining the Kalman denominator matrix $\mathbold{D}$ as follows:

\noindent we finally get a form for the Kalman gain matrix $\mathbold{K}$ entirely in terms of priors and the new observation partials (sometimes called the innovation):

\noindent These are almost the same as the original definitions, equations \ref{eqn:kalman-gain-definition} and \ref{eqn:kalman-denominator-definition}, which were written in dimensional form. We leave it to the reader to show that the dimensional form for $\mathbold{D}$ is $\mathbold{Z}+ \mathbold{A}\, \mathbold{P}\, \mathbold{A}^\intercal$.

3.4.3 Two More Recurrences

There remain two more recurrences to derive, namely

\noindent and the canonical form,

3.4.4 Minimizing $J_1({\mathbold{x}})$

The posterior covariance is, from the statistical viewpoint,

\noindent Get our new expression for ${\hat{\mathbold{x}}}$:

\noindent where, again

\noindent Remembering the observation equation (\ref{eqn:observation-equation}), write a single instance of it $\mathbold{z} = \mathbold{A}\, \mathbold{x}+ \mathbold{ζ}$ and find

\noindent implying that $ \left( {\hat{\mathbold{x}}}- \mathbold{x} \right)= \mathbold{L}\, \left( {˜{\mathbold{x}}}- \mathbold{x} \right) + \mathbold{K}\, \mathbold{ζ} $.

Remembering that $E \left[ \mathbold{ζ} \right]=\mathbold{0}$, $E \left[ \mathbold{ζ}\, \mathbold{ζ}^\intercal \right]=\mathbold{Z}$, glibly re-dimensionalizing and skipping intermediate steps, we find that

\noindent We leave it to the reader to check, with reference to equations \ref{eqn:dimensional-breakdown}, that the physical dimensions work out. This completes the derivation of the recurrence equation \ref{eqn:p-is-lplt-plus-kzkt}.

The last form, $\hat{\mathbold{P}} = ˜{\mathbold{P}}- \mathbold{K}\, \mathbold{D}\, \mathbold{K}^\intercal$, is easy to show from what we already know, that $\hat{\mathbold{P}} = \mathbold{L}\, ˜{\mathbold{P}} = (\mathbold{1}- \mathbold{K}\, \mathbold{A})\, ˜{\mathbold{P}}$. We just need to show that $\mathbold{K}\, \mathbold{A}\, ˜{\mathbold{P}} = \mathbold{K}\, \mathbold{D}\, \mathbold{K}^\intercal$. Substitute $\mathbold{D}^-\intercal\, \mathbold{A}\, ˜{\mathbold{P}}^\intercal$ for $\mathbold{K}^\intercal$ by transposing equation \ref{eqn:kalman-gain-definition-2}. Note that for square matrices, the inverse of the transpose is the transpose of the inverse. Therefore $\mathbold{D}^-\intercal = \mathbold{D}^-1$ because $\mathbold{D}$ is symmetric. Likewise $˜{\mathbold{P}}^\intercal=˜{\mathbold{P}}$. The result follows:

4 Concluding Remarks

These derivations are helpful for gaining intuition into the underlying statistics and dimensional structures of the Kalman filter and its many variants. They are a bit involved, but it is worthwhile to ingest these fundamentals, especially for those who need to research new filters and applications. For more rigorous proofs built on a Bayesian perspective, see Bar-Shalom.^bars For more careful dimensional analysis of the present derivations, see part 6 of this series.^klf6

^affn https://en.wikipedia.org/wiki/Affine_transformation ^bars Bar-Shalom, Yaakov, et al. Estimation with applications to tracking and navigation. New York: Wiley, 2001. ^bier http://tinyurl.com/h3jh4kt ^bssl https://en.wikipedia.org/wiki/Bessel’s_correction ^busi https://en.wikipedia.org/wiki/Business_logic ^cdot We sometimes use the center dot or the $×$ symbols to clarify matrix multiplication. They have no other significance and we can always write matrix multiplication just by juxtaposing the matrices. ^clos https://en.wikipedia.org/wiki/Closure_(computer_programming) ^cold This convention only models so-called cold observables, but it’s enough to demonstrate Kalman’s working over them. ^cons This is quite similar to the standard — not Wolfram’s — definition of a list as a pair of a value and of another list. ^cova We use the terms covariance for matrices and variance for scalars. ^csoc https://en.wikipedia.org/wiki/Separation_of_concerns ^ctsc https://en.wikipedia.org/wiki/Catastrophic_cancellation ^dstr http://tinyurl.com/ze6qfb3 ^elib Brookner, Eli. Tracking and Kalman Filtering Made Easy, New York: Wiley, 1998. http://tinyurl.com/h8see8k ^folk http://ipnpr.jpl.nasa.gov/progress_report/42-178/178C.pdf ^fldl http://tinyurl.com/jmxsevr ^fwik https://en.wikipedia.org/wiki/Fold_%28higher-order_function%29 ^gama https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem ^intr http://introtorx.com/ ^jplg JPL Geodynamics Program http://www.jpl.nasa.gov/report/1981.pdf ^just justified by the fact that $\mathbold{D}$ is a diagonal matrix that commutes with all other products, therefore its left and right inverses are equal and can be written as a reciprocal; in fact, $\mathbold{D}$ is a $1×{1}$ matrix — effectively a scalar — in all examples in this paper ^klde B. Beckman, Kalman Folding 3: Derivations, to appear. ^klf1 B. Beckman, Kalman Folding, Part 1, http://vixra.org/abs/1606.0328. ^klf2 B. Beckman, Kalman Folding 2: Tracking and System Dynamics, http://vixra.org/abs/1606.0348. ^klf3 B. Beckman, Kalman Folding 3: Derivations, to appear. ^klf4 B. Beckman, Kalman Folding 4: Streams and Observables, to appear. ^klf5 B. Beckman, Kalman Folding 5: Non-Linear Models and the EKF, to appear. ^klf6 B. Beckman, Kalman Folding 6: Dimensional Analysis, to appear. ^layi https://en.wikipedia.org/wiki/Fundamental_theorem_of_software_engineering ^lmbd Many languages use the keyword lambda for such expressions; Wolfram uses the name Function. ^lmlf https://en.wikipedia.org/wiki/Lambda_lifting ^lssq https://en.wikipedia.org/wiki/Least_squares ^ltis http://tinyurl.com/hhhcgca ^matt https://www.cs.kent.ac.uk/people/staff/dat/miranda/whyfp90.pdf ^mcmc https://en.wikipedia.org/wiki/Particle_filter ^musc http://www1.cs.dartmouth.edu/~doug/music.ps.gz ^ndim https://en.wikipedia.org/wiki/Nondimensionalization ^patt http://tinyurl.com/j5jzy69 ^pseu http://tinyurl.com/j8gvlug ^rasp http://www.wolfram.com/raspberry-pi/ ^rcrn https://en.wikipedia.org/wiki/Recurrence_relation ^rsfr http://rosettacode.org/wiki/Loops/Foreach ^rxbk http://www.introtorx.com/content/v1.0.10621.0/07_Aggregation.html ^scan and of Haskell’s scans and folds, and Rx’s scans and folds, etc. ^scla http://tinyurl.com/hhdot36 ^scnd A state-space form containing a position and derivative is commonplace in second-order dynamics like Newton’s Second Law. We usually employ state-space form to reduce $n$-th-order differential equations to first-order differential equations by stacking the dependent variable on $n-1$ of its derivatives in the state vector. ^scnl http://learnyouahaskell.com/higher-order-functions ^stsp https://en.wikipedia.org/wiki/State-space_representation ^uncl The initial uncial (lower-case) letter signifies that we wrote this function; it wasn’t supplied by Wolfram. ^wfld http://reference.wolfram.com/language/ref/FoldList.html?q=FoldList ^wlf1 http://tinyurl.com/nfz9fyo ^wlf2 http://rebcabin.github.io/blog/2013/02/04/welfords-better-formula/ ^wolf http://reference.wolfram.com/language/ ^zarc Zarchan and Musoff, Fundamentals of Kalman Filtering, A Practical Approach, Fourth Edition, Ch. 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kalman-folding-03-008-derivations.org

kalman-folding-03-008-derivations.org

Kalman Folding 3: Derivations (Review Draft)

1 Abstract

2 Kalman Folding

3 Derivations

3.1 Notation

3.1.1 Probability and Statistics

3.2 Definitions

3.3 Demonstration that Prior Covariance ${˜{\mathbold{P}}} = ˜{\mathbold{A}}^-2$

3.3.1 Covariance of Any Random Vector Variable

3.3.2 Prior Estimate ${˜{\mathbold{x}}}$

3.3.3 Sum of Squared Residuals

3.3.4 Prior Covariance $˜{\mathbold{P}}$

3.3.5 De-Dimensionalizing the Observation Equation

3.4 Posterior Estimate $\hat{\mathbold{x}}$ and Covariance $\hat{\mathbold{P}}$

3.4.1 Posterior estimate, $\hat{\mathbold{x}}$

3.4.2 A Gain Matrix $\mathbold{K}$ We Can Actually Compute

3.4.3 Two More Recurrences

3.4.4 Minimizing $J_1({\mathbold{x}})$

4 Concluding Remarks

Files

kalman-folding-03-008-derivations.org

Latest commit

History

kalman-folding-03-008-derivations.org

File metadata and controls

Kalman Folding 3: Derivations (Review Draft)

1 Abstract

2 Kalman Folding

3 Derivations

3.1 Notation

3.1.1 Probability and Statistics

3.2 Definitions

3.3 Demonstration that Prior Covariance ${˜{\mathbold{P}}} = ˜{\mathbold{A}}-2$

3.3.1 Covariance of Any Random Vector Variable

3.3.2 Prior Estimate ${˜{\mathbold{x}}}$

3.3.3 Sum of Squared Residuals

3.3.4 Prior Covariance $˜{\mathbold{P}}$

3.3.5 De-Dimensionalizing the Observation Equation

3.4 Posterior Estimate $\hat{\mathbold{x}}$ and Covariance $\hat{\mathbold{P}}$

3.4.1 Posterior estimate, $\hat{\mathbold{x}}$

3.4.2 A Gain Matrix $\mathbold{K}$ We Can Actually Compute

3.4.3 Two More Recurrences

3.4.4 Minimizing $J_1({\mathbold{x}})$

4 Concluding Remarks

3.3 Demonstration that Prior Covariance ${˜{\mathbold{P}}} = ˜{\mathbold{A}}^-2$