Skip to content

2. Implementation

Miha Pompe edited this page Apr 30, 2024 · 1 revision

Loss function and performance metrics

Our goal is to create a network that takes as an input time $t$ and returns the concentration of nuclides at that time $\mathbf{N}(t) \in \mathbb{R}^n$, such that the solution satisfies the differential equation and initial conditions. This will be enforced by the two terms in the loss function:

$$\mathcal{L}_{ODE} = \frac{1}{T} \sum_{i = 1}^{T}\left|\left|\frac{d\mathbf{N}}{dt}(t_i) - A \mathbf{N}(t_i)\right|\right|^2,\\$$ $$\mathcal{L}_{IC} = ||\mathbf{N}(0) - \mathbf{N}_0||^2,$$

where $T$ is the number of time data points used for training and $\mathbf{N}_0$ is the initial concentration of nuclides. The total loss is a weighted sum of both losses:

$$\mathcal{L} = \frac{w_{ODE} \mathcal{L}_{ODE} + w_{IC} \mathcal{L}_{IC}}{w_{ODE} + w_{IC}}.$$

Since the total loss $\mathcal{L}$ is invariant under scaling of the weights we can set $w_{IC} = 1$ without loss of generality. Let us define the weight ratio as $w = w_{IC}/w_{ODE}$, which is now a hyperparamter of the network. Changing $w$ therefore changes the focus of the network to either give more importance to the initial conditions ($w >> 1$) or to the differential equation ($w \approx 1$).

It is desired that the network is able to accurately and quickly solve big and stiff problems, for example $n = 2000$ and $\kappa(A) = 10^{20}$. To evaluate the performance of the model, adequate performance and error metrics are needed. Intrinsic model metrics that we will use are $\mathcal{L}$, $\mathcal{L}_{IC}$, and $\mathcal{L}_{ODE}$. But to verify our results we need external reference results which will either be analytical or numerical solutions (CRAM). Let's define the total absolute error and the final absolute error as:

$$\Delta N = \sum_{i = 0}^{T}\left|\left|\mathbf{\hat{N}}(t_i) - \mathbf{N}(t_i)\right|\right|,\\$$ $$\Delta N(t_T) = \left|\left|\mathbf{\hat{N}}(t_T) - \mathbf{N}(t_T)\right|\right|,$$

where $\mathbf{\hat{N}}$ is the reference solution, $N$ the NN solution, and $t_T$ the upper time limit on the training data points, $t \in [0, t_T]$. Let's also define the total relative error as $\delta N = \Delta N/\Sigma_{t = 0}^{t_T}||\mathbf{\hat{N}}(t)||$ and final relative error as $\delta N(t_T) = \Delta N(t_T)/||\mathbf{\hat{N}}(t_T)||$.

Decay problem

The decay matrix $A \in \mathbb{R}^{n\times n}$ is a matrix with real eigenvalues $\lambda_i$ and if those are distinct the solution can be written as a linear combination of exponential functions $\{\mathbf{a}_i e^{\lambda_i t}\}$, as described in. We can reformulate this solution in terms of a neural network with one hidden layer with $n$ neurons, an input layer with one neuron and an output layer with $n$ neurons. The activation function of the network is $\sigma(z) = e^z$, with known weights of the input layer $\mathbf{w}_1 = (\lambda_1, ..., \lambda_n)$. The output of the input layer is therefore $h_i = e^{\lambda_i t}$. All the biases in the network are set to zero and are not updated. Given this setup we can write the solution as:

$$\mathbf{N}(t) = M\mathbf{h},\quad\text{with }\mathbf{h} = (e^{\lambda_1 t}, ..., e^{\lambda_n t}).$$

Therefore the network only needs to learn the coefficients of matrix $M \in \mathbb{R}^{n\times n}$, which correspond to the coefficients $\mathbf{a}_i$ in the linear combination. A graphic representation of the network is shown in Figure. The network needs to learn $n^2$ parameters and we have also constrained the size and shape of the network. Compared to regular neural networks far fewer parameters need to be learned and we do not have to perform a hyperparameter search to find the optimal network architecture.

Besides fixing the weights of the input layer and removing biases we can also constrain the weights of the hidden layer (coefficients of matrix $M$). Every decay matrix can be permuted into a triangular matrix. It is allowed to simultaneously swap two columns and two rows with the same indices. By applying this operation multiple times we can permute the matrix into lower or upper triangular. In our case we choose lower triangular, but the choice is arbitrary. Given a lower triangular matrix $A$ we can constrain the matrix $M$ to be lower triangular. Such condition can be enforced by adding a new term to the loss function that penalizes non-triangular matrices:

$$\mathcal{L} = \sum_{i=1}^n\sum_{j>i}^n |M_{ij}|^2.$$

A better approach is to initialize $M$ as lower triangular and at each training step set the gradients of the upper triangle to zero. This approach is better because the network does not have to learn the coefficients of the upper triangle, thereby reducing the number of parameters to $\frac{n(n+1)}{2}$.

Due to measurement error of matrix A it is not uncommon for two eigenvalues to be the same, which poses an issue since the basis functions are not the solutions of the problem. Additionally given two equal eigenvalues two outputs of the input layer would be indistinguishable for all times, the solution wouldn't be unique. To avoid this we slightly change the eigenvalues, such that the difference is smaller than the measurement error. A problem also arises when the eigenvalue is zero, in this case the particular nuclide does not contribute to the reaction and can therefore be excluded, the row and column of the nuclide are removed.

Burnup problem

The burnup matrix $A$ consists of eigenvalues with a complex part, therefore the solution is no longer a linear combination of exponential functions but rather a combination of exponential functions paired with sines and cosines, as described in equation. The network architecture is therefore more complex. Instead of using one network, two networks will be used, one to calculate the linear combination of cosines (Figure blue box) and the other of sines (Figure red box), such that the output of the two input layers will be:

$$h_i = \cos(\Im(\lambda_i) t) e^{\Re(\lambda_i) t},\quad i \in [1,n]\\$$ $$h_{n+i} = \sin(\Im(\lambda_i) t) e^{\Re(\lambda_i) t},\quad i \in [1,n].$$

The network then feeds into two output layers, where the network will learn the matrices $M, M' \in \mathbb{R}^{n\times n}$, $\tilde{N}_j = M_{ij} h_i$ and $\tilde{N}_{n+1} = M'_{ij} h_{n+i}$. The number of parameters is $2n^2$. The last part (merging layer) is a pairwise summation of sine and cosine parts, $N_i = \tilde{N}_i + \tilde{N}_{n+i}$.

The network weights can be further constrained. For real eigenvalues $h_{n+i}$ will be equal to zero, therefore they provide a source of instability in the network. To fix this we can set the $i$-th row and column in matrix $M'$ to zero and constantly set the gradient of these elements to zero.

Performance improvements

In order to improve the performance of the model the following techniques have been tested.

LBFGS (limited-memory BFGS) optimizer was chosen instead of more common ones like stochastic gradient descent or Adam. The impact of the optimizer was already studied in. LBFGS was chosen due to faster convergence time and in order to eliminate a hyperparameter, the learning rate.

Gradual increase of stiffness. Solving the problem is easier when the stiffness of matrix $A$ is low. Therefore the proposed method is to gradually increase the stiffness of matrix $A$. The network would be first trained on the matrix $A$ with reduced stiffness. These weights would be then used as initial weights when training the model on a bit stiffer matrix. This would then be repeated until we come to the original matrix. In order to adjust the stiffness of the matrix every eigenvalue ($\lambda_i$) was first scaled such that $\tilde{\lambda}_i = \left(\frac{\lambda_i - \lambda_{min}}{\lambda_{max}-\lambda_{min}}(S-1)+1\right)\frac{\lambda_{min}}{\lambda_i}$, where $S$ is the desired stiffness, $\lambda_{min} = \min |\lambda_i|$ and $\lambda_{max} = \max |\lambda_i|$. The columns of matrix $A$ are then scaled by the scaled eigenvalues, $\tilde{A}_{j} = A_{ij} \tilde{\lambda}_i$. Tests were performed with a matrix of stiffness $10^{16}$. Firstly the stiffness was reduced to $10^{10}$ and then after each training increased by one magnitude. The resulting error was 8% lower compared to not using this method.

Gradual expansion of training data. The training data of the model is a list of times, $t_i\in[0, t_T]$. In this method we gradually increase $t_T$ every epoch up to a desired value, $t_T(epoch) = t_T \frac{epoch}{v_{epoch}}$, where $v_{epoch}$ tells us when the method will train with all the data points. Doing so allows the model to first learn the initial conditions and the behavior of quickly decaying nuclides and only afterwards it will focus on the rest of the time steps. Implementing this method reduced the relative final error at most by 18 % compared to not using this method. The effect of the method depends on the speed of increase of $t_T$, which is set by $v_{epoch}$. The average reduction for different $v_{epoch}$ of relative final error is 8,1 %.

Dynamic weight ratio. The weight ratio determines which loss ($\mathcal{L}_{IC}$ or $\mathcal{L}_{ODE}$) the network should focus on. Normally this ratio is determined by the user, such that the optimal result is achieved, and it does not change with each epoch. This method proposes a use of a function, $f(epoch)$, to dynamically change the weight ratio. The function always started with a given ratio $w = w_{IC}/w_{ODE} = w_0$. The following functions were tested:

  • Step function
$$f(epoch) = \begin{cases} w_0, & epoch \leq \alpha, \\\ w_1, & epoch > \alpha, \end{cases}$$

where $w_1$ is the final desired ratio and $\alpha$ is the cutoff point. $w_1 < w_0$ since we first want the network to learn the initial conditions and then the differential equation.

  • Linear function
$$f(epoch) = \begin{cases} w_0-(w_0 - w_1)\frac{epoch}{\alpha}, & epoch \leq \alpha, \\\ w_1, & epoch > \alpha. \end{cases}$$
  • Geometric decrease, $f(epoch)=w_0 \beta^{epoch}$, where $\beta$ is the decrease rate, $\beta < 1$. The problem with this method is that if the network does not learn the initial conditions in the beginning, it will never be able to do so. The output is very dependent on the choice of $\beta$.
  • The function does not depend on $epoch$ but rather on the ratio between $\mathcal{L}_{IC}$ and $\mathcal{L}_{ODE}$, such that:
$$f(epoch) = \begin{cases} w_1, & \log(\mathcal{L}_{IC}/\mathcal{L}_{ODE}) \leq \gamma, \\\ w_0, & \log(\mathcal{L}_{IC}/\mathcal{L}_{ODE}) > \gamma, \end{cases}$$

where $\gamma$ is the cutoff ratio that switches between prioritizing $\mathcal{L}_{IC}$ over $\mathcal{L}_{ODE}$ and vice versa. This method allows the network to gradually decrease one loss and then switch to decreasing the other. The process repeats when the other loss gets low enough.

The performance of the model did not improve by employing these functions, even after extensive experimentation with different parameters $\alpha$, $\beta$, and $\gamma$. In all cases it was equal or better to set the ratio to a constant value of $w_1$.

Clone this wiki locally