The discussion in the previous two sections already hints at inversion of gradients being a important step for optimization and learning.
We will now integrate the update step Simple Example <physgrad-comparison>
.
As outlined in the IG section of {doc}physgrad
, we're focusing on NN solutions of inverse problems below. That means we have
Important to keep in mind:
In contrast to the previous sections and {doc}`overview-equations`, we are targeting inverse problems, and hence $y$ is the input to the network: $f(y;\theta)$. Correspondingly, it outputs $x$.
This gives the following minimization problem with
$$ \text{arg min}\theta \sum{i} \frac 1 2 | \mathcal P\big(f(y^_i ; \theta)\big) - y^_i |_2^2 $$ (eq:unsupervised-training)
To integrate the update step from equation {eq}PG-def
into the training process for an NN, we consider three components: the NN itself, the physics simulator, and the loss function:
---
height: 160px
name: sip-spaces
---
A visual overview of the different spaces involved in SIP training.
To join these three pieces together, we use the following algorithm. As introduced by Holl et al. {cite}holl2021pg
, we'll denote this training process as scale-invariant physics (SIP) training.
:class: tip
To update the weights $\theta$ of the NN $f$, we perform the following update step:
* Given a set of inputs $y^*$, evaluate the forward pass to compute the NN prediction $x = f(y^*; \theta)$
* Compute $y$ via a forward simulation ($y = \mathcal P(x)$) and invoke the (local) inverse simulator $P^{-1}(y; x)$ to obtain the step $\Delta x_{\text{PG}} = \mathcal P^{-1} (y + \eta \Delta y; x)$ with $\Delta y = y^* - y$
* Evaluate the network loss, e.g., $L = \frac 1 2 | x - \tilde x |_2^2$ with $\tilde x = x+\Delta x_{\text{PG}}$, and perform a Newton step treating $\tilde x$ as a constant
* Use GD (or a GD-based optimizer like Adam) to propagate the change in $x$ to the network weights $\theta$ with a learning rate $\eta_{\text{NN}}$
% xxx TODO, make clear, we're solving the inverse problem
This combined optimization algorithm depends on both the learning rate
This algorithm combines the inverse simulator to compute accurate, higher-order updates with traditional training schemes for NN representations. This is an attractive property, as we have a large collection of powerful methodologies for training NNs that stay relevant in this way. The treatment of the loss functions as "glue" between NN and physics component plays a central role here.
In the above algorithm, we have assumed an
The central reason for introducing a Newton step is the improved accuracy for the loss derivative.
Unlike with regular Newton or the quasi-Newton methods from equation {eq}quasi-newton-update
, we do not need the Hessian of the full system.
Instead, the Hessian is only needed for
E.g., consider the most common supervised objective function, $L(y) = \frac 1 2 | y - y^|_2^2$ as already put to use above. $y$ denotes the predicted, and $y^$ the target value.
We then have $\frac{\partial L}{\partial y} = y - y^$ and $\frac{\partial^2 L}{\partial y^2} = 1$.
Using equation {eq}quasi-newton-update
, we get $\Delta y = \eta \cdot (y^ - y)$ which can be computed right away, without evaluating any additional Hessian matrices.
Once physgrad-comparison
.
The loss in
Hence, to summarize with SIPs we employ a trivial Newton step for the loss in
---
height: 220px
name: sip-training
---
A visual overview of SIP training for an entry $i$ of a mini-batch, including the two loss computations in $y$ and in $x$ space (for the proxy loss).
The above procedure describes the optimization of neural networks that make a single prediction.
This is suitable for scenarios to reconstruct the state of a system at
However, the SIP method can also be applied to more complex setups involving multiple objectives and multiple network interactions at different times.
Such scenarios arise e.g. in control tasks, where a network induces small forces at every time step to reach a certain physical state at
In these scenarios, the process above (Newton step for loss, inverse simulator step for physics, GD for the NN) is iteratively repeated, e.g., over the course of different time steps, leading to a series of additive terms in
Let's illustrate the convergence behavior of SIP training and how it depends on characteristics of holl2021pg
.
We consider the synthetic two-dimensional function
%$$\mathcal P(x) = \left(\frac{\sin(\hat x_1)}{\xi}, \xi \cdot \hat x_2 \right) \quad \text{with} \quad \hat x = R_\phi \cdot x$$
Here's an example of the resulting loss landscape for
---
height: 200px
name: physgrad-sin-loss
---
Next we train a fully-connected neural network to invert this problem via equation {eq}eq:unsupervised-training
.
We'll compare SIP training using a saddle-free Newton solver to various state-of-the-art network optimizers.
For fairness, the best learning rate is selected independently for each optimizer.
When choosing
---
height: 180px
name: physgrad-sin-time-graphs
---
Loss over time in seconds for a well-conditioned (left), and ill-conditioned case (right).
At
Note that the two graphs above show convergence over time. The relatively slow convergence of SIP mostly stems from it taking significantly more time per iteration than the other methods, on average 3 times as long as Adam. While the evaluation of the Hessian inherently requires more computations, the per-iteration time of SIP could likely be significantly reduced by optimizing the computations.
By increasing
---
height: 180px
name: physgrad-sin-add-graphs
---
Performance when varying the conditiong (left) and the entangling of dimensions via the rotation (right).
The accuracy of all traditional network optimizers decreases because the gradients scale with
By varying only physgrad-sin-add-graphs
varies
Although we've only looked at smaller toy problems so far, we'll pull the discussion of SIP training forward. The next chapter will illustrate this with a more complex example, but as we'll directly switch to a new algorithm afterwards, below is a better place for a discussion of the properties of SIP.
Overall,
the scale-invariance of SIP training allows it to find solutions exponentially faster than other learning methods for many physics problems, while keeping the computational cost relatively low.
It provably converges when enough network updates
While SIP training can find vastly more accurate solutions, there are some caveats to consider.
%
First, an approximately scale-invariant physics solver is required. While in low-dimensional
Second, SIP focuses on an accurate inversion of the physics part, but uses traditional first-order optimizers to determine
Third, while SIP training generally leads to more accurate solutions, measured in
Interestingly, the SIP training resembles the supervised approaches from {doc}supervised
.
It effectively yields a method that provides reliable updates which are computed on-the-fly, at training time.
The inverse simulator provides the desired inversion, possibly with a high-order method, and
avoids the averaging of multi modal solutions (cf. {doc}intro-teaser
).
The latter is one of the main advantages of this setup: a pre-computed data set can not take multi-modality into account, and hence inevitably leads to suboptimal solutions being learned once the mapping from input to reference solutions is not unique.
At the same time this illustrates a difficulty of the DP training from {doc}diffphys
: the gradients it yields
are not properly inverted, and are difficult to reliably normalize via pre-processing. Hence they can
lead to the scaling problems discussing in {doc}physgrad
, and correspondingly give vanishing and exploding gradients
at training time. These problems are what we're targeting in this chapter.
In the next section we'll show a more complex example of training physics-based NNs with SIP updates from inverse simulators, before explaining a second alternative for tackling the scaling problems.