Table of Contents
- Introduction
- FMM and the structured representation of matrices
- FMM level-1
- FMM level-2
- L and U using FMM
- Learning LU
- Limitations
- Some training tricks
- Experiments
- Evaluation
- DiscreteLaplacian
- Convection-Diffusion Operator (Asymmetric PD)
- Bi-harmonic operator
- RBF Kernel (dense + low-rank)
- Conclusion
- References
Introduction
In our previous work1, we introduced the GFMM-block, which is a generalized parameterization of the Fast Multipole Method’s (FMM) matrix-vector computational graph. We showed that gradient descent can be used to learn FMM-like representations of (inverses of) linear operators instead of relying on hand-crafted FMM construction algorithms. This is useful when the operator structures are complex and the construction of FMM is unknown for certain operators. Moreover, the parameterization is efficient in the number of required parameters compared to a fully dense representation of the operator.
In this post, I want to extend this idea further to explore the possibility of learning the factorizations of linear operators, specifically the LU factorization. Previous work2 tried to achieve this with gradient descent by representing L and U as dense linear layers and masking the upper and lower triangular parts of the learnable weight matrices, which is not an efficient parameterization. Inspired by the GFMM-block, we can actually represent L and U more compactly and thus save on memory as well as compute complexity. This could be useful for large matrices, as the GFMM-block representation grows only as in the number of parameters.
This post is structured as follows:
- GFMM-block : I will describe the GFMM-block and how it can be also used to represent block-lower and block-upper triangular matrices.
- Training to factorize: I describe the training process and some tricks that we used to improve the training like transpose loss, block-wise training.
- Examples: Numerical examples of LU factorization of 1D discrete Laplacian, 1D and 2D, convection diffusion, 1D and 2D biharmonic operators and dense but low-rank covariance matrix of RBF kernel.
- Thoughts: I will discuss the limitations of the current approach and some thoughts about future work.
FMM and the structured representation of matrices
FMM level-1
Before I describe the GFMM-block, I want to take a moment here to introduce the FMM representation. Consider, for example, an exponential covariance matrix defined as where . This matrix appears frequently in ML applications (Gaussian Processes with exponential kernels) and Physics (William Lyons’ PhD thesis3 contains more such rank-structured matrix examples). This is a fully dense matrix. But if we block-partition it and observe the , we notice that they are actually rank-1 (see an numerical example below).
Thus we can compactly represent the above matrix as
Let us see how this affects the computation of matrix-vector product. Let be the vector we need to multiply and we conformally partition it as and we get the output as
where the intermediate multiplications are shown explicitly. This resembles a feed-forward computational graph as shown below.
This way, each multiplier is a smaller sub-matrix () where is the low-rank of the off-diagonal block. The full-rank diagonal blocks are only multiplied to the input sub-vectors and added back at the end of the computational graph. In this way, an FMM matrix-vector multiplication can be compactly represented as a feed-forward network and efficiently computed.
FMM level-2
We can actually do even better than the above, by recursively partitioning the diagonal blocks like this:
and obtaining the low-rank representations of the level-2 off-diagonal blocks (super-scripts denote the level).
Moreover, the level-1 orthogonal matrices can actually be decomposed using the level-2 matrices like this:
So we actually do not need to store the larger and only need to store smaller , thus further reducing the memory complexity.
The matrix-vector multiplication now becomes:
and the corresponding feed-forward network is expressed as
The up-sweep and down-sweep recursions have a binary-tree structure and resemble an encoder and decoder of a U-Net style architecture. The bridge connections should be block-diagonal for the given example, but as shown in the figure above, they can be made more generalized to accommodate more complicated rank structures in the FMM representation (figure shows block tri-diagonal, they can also be made lower/upper diagonal, penta-diagonal, Toeplitz, Circulant etc, depending on the structure of the underlying linear operator).
To summarize everything so far, we looked at how dense, rank-structured matrices can be represented using FMM and how the dense matrix-vector multiply, that requires operations, can be implemented in , using FMM computational graph that also looks like a feed-forward (U-Net-like) network. We call the above network which is a generalized FMM skeleton as GFMM-block.
L and U using FMM
As already noted above, we can use the GFMM-block to represent different structured matrices. To see how we can obtain a block-triangular matrix from GFMM-block, notice in the architecture that the output block-wise components depend on directly through the outer-most bridge connection. Specifically, if the bridge is purely block-diagonal, then only depends on . If it is block-tri-diagonal, then only depends on for . So in order to make the operator block-triangular, it only suffices to make the bridge connections also block-triangular. There is just one catch, for the interior bridge connections from to , each also depends on for through the diagonal blocks. This pathway leaks signal from to both and resulting in each block of output depending on every other block of input. So to get a block-triangular structure, we need to make the interior bridge connections strictly upper or strictly lower-triangular. To see this concretely, consider a worked out example below (notice the lower-bi-diagonal outer-most bridge, but strictly lower-diagonal inner bridges):
This computes the matrix-vector multiplication of a block-lower-triangular matrix , from which we can deduce it’s representation in the form of network’s parameters as:
Similarly, by making the interior bridge connections to be strictly block-upper-diagonal and outer bridge to be block-upper-bi-diagonal, we can construct a block-upper-triangular matrix .
Learning LU
Given that we can represent a block-lower-triangular and block-upper-triangular matrices in the form of a feed-forward network, that can compute their corresponding matrix-vector products, we now ask the question that we are actually interested in: can we learn the LU factorization of any dense matrix?
In our MNO paper 1, we showed that the FMM representations of inverse of 1D and 2D discrete Laplacians can be learnt by randomly initializing the parameters of GFMM-block and training using the synthetic data where and is the Laplacian operator. We used a simple MSE loss function where is the network’s predicted output for a given input . In this way, we do not need to actually compute to train the network.
Similarly, our goal in learning the LU factorization of a matrix is so that we can use the learnt network to solve without actually computing (even during training). Then how do we actually train this LU-network? We can see that by stacking the U and L GFMM-blocks sequentially ( ), we are actually computing . So we use the similar procedure as before to train this joint LU network like this:
- Randomly initialize the parameters in L and U GFMM-block networks.
- For the input (‘th column of Identity matrix), should produce the ‘th column of . Thus the pairs form our input-groundtruth pairs.
- Use MSE loss function and backpropagate using SGD, where .
Once the network is trained, if needed, L and U can be computed separately by generating columns . But of course, the network representations are compact.
Limitations
- The main limitation of this proposed approach is the lack of pivoting because the networks are static in their representations. This might constrain the domain of learnable matrices to only positive definite matrices. In which case, the advantage of this learnable method over classical gaussian elimination would only be that of memory efficiency.
- Another major limitation is numerical accuracy. Because the networks are composed as sequence of matrix-matrix multiplications, the floating point errors accumulate rapidly.
Because of these limitations, this method might be more suitable for computing approximate LU and to be used as preconditioners for iterative solvers.
Some training tricks
We found that using a couple of other tricks, training of LU network could be improved:
Transpose loss
Firstly, note that could also be computed using the same GFMM-block that is representing a given matrix by simple changes in the computational graph and transposing each of weight matrices. We noticed that adding also into the loss function helped in faster convergence of training. To do this, we perform the forward pass twice, once without transpose and once with transpose for a given and compute the total MSE loss w.r.t and . This also improved the error patterns in the final approximation. See the numerical example in DiscreteLaplacian below.
Block-wise training
We noticed that when training the entire network end-to-end, large errors are appearing at the top left diagonal blocks of (see the DiscreteLaplacian for an example). To understand why that could be the case, take a look at the expanded form of the parameterized LU from the previous example:
We can see that the lower blocks have more number of parameters contributing to the values at those positions. So, the network has more freedom to optimize and reduce the error for those blocks, whereas top-left blocks are more constrained by fewer parameters. Is there a way to fix this without trading-off the smaller errors in bottom right blocks? We found that there is actually a way. To see how, first observe that the parameters have a dependency pattern from top left to bottom right. That is, NO parameter at the location of the combined matrix depends on the parameters at locations.
So, we can do sequential optimization, starting by only updating the parameters appearing in the first row by only computing MSE loss of i.e. first block row of . Once the error of this first block row converges to a sufficiently smaller value, we can freeze these parameters and only backpropagate using the MSE loss of (i.e. second block row of ) and continue until the last block row. We observed that this resolved the issue of larger errors at the top-left locations without compromising on the smaller errors at bottom-right locations.
This way, at each block row position, only the necessary parameters are updated and a more fine-grained control over the errors of different locations of is also obtained. It seems to me that this can also be leveraged to do pivoting during training, but I have not yet explored it.
Experiments
Evaluation
We can use the following three metrics to evaluate the accuracy of the learnt LU network.
-
Operator approximation error :
where and are the block-lower and block-upper triangular matrices, represented by the and GFMM-blocks respectively, and generated by simply inputting the Identity matrix as input to each GFMM-block.
This metric measures how good the block-LU factored approximation is in approximating the original matrix. We also visually look at the error matrix , to check which areas of the matrix are approximated better/worse.
-
True solution error For a given linear equation , we want to look at how good the LU approximation is in solving for , which is the usual goal of any linear solver. So, we can compute the predicted solution , obtained by forward and back-substitution from the generated L, U matrices and compare with the ground-truth .
We can also compare this inverse approximation error, with that of using the standard LU solver in LAPACK: .
-
Backward error In realistic cases we do not have access to ground-truth of course. So we can only do as much as measuring the backward error given by
Again, we can measure the backward error for both the learnt LU network and the baseline LAPACK solver .
Now let us see some numerical examples.
DiscreteLaplacian
First let us take the simplest case of 1D discrete Laplacian kernel matrix to approximate. For all the 1D experiments, I use 256x256 grid size with block size=32 in the GFMM-blocks.
Following graph shows training losses of LU network with three different configurations: without transpose loss, without using block-wise training and finally using both transpose loss and block-wise training, as described earlier. We see that using transpose loss, training converges faster.

Following are the evaluation metrics for the three networks. For and , in all experiments, the errors are averaged over 1000 samples where was drawn from a random normal distribution.
| Without transpose loss | 2.0E-3 | 0.024 | 0.370 | 1.72E-8 | 3.59E-7 |
| Without block-wise training | 6.6E-2 | 0.024 | 1.72E-8 | ||
| Transpose-loss + block-wise training | 9.6E-3 | 0.024 | 0.129 | 1.72E-8 | 4.06E-7 |
Following is the absolute error of the approximated operator .

When trained without transpose loss, we can also see that the approximation is an asymmetric matrix whereas the actual Laplacian matrix is a symmetric matrix. Using transpose loss, not only does the training converges faster, but also the approximation obeys the symmetric structure of the underlying operator. This is because the updates to L and U networks are symmetric by adding the transpose loss:
as during backpropagation, both L and U weights appear symmetrically and at the same time in the computational graph.
When trained without block-wise training, we can see that large errors accumulate in the top-left parts of the matrix, as I described in the previous section. This actually leads to very large condition number of the approximated matrix and consequently, the errors blew up when solving . But using the block-wise training, the errors are more or less evenly distributed across the diagonal.
Now let us see some more complicated operators.
Convection-Diffusion Operator (Asymmetric PD)
using up-wind finite difference discretization to keep the operator positive definite, but the presence of both diffusion and convection terms introduces asymmetry.
1D
| 4.9E-3 | 0.345 | 29380.7 | 1.84E-8 | 1.17E-5 |
| Condition number of = 391199. | ||||
| Condition number of = 4236075520. |
2D
| 1.6E-4 | 1.8E-4 | 2.1E-3 | 3.15E-8 | 6.19E-7 |
Bi-harmonic operator
4-th order system to really test the numerical instability. In 1D, this looks like a pentadiagonal matrix (row: 1, -4, 6, -4, 1). The condition number grows as rather than like the Laplacian. Used in continuum mechanics to model the bending of thin plates or the stream function in Stokes flow. The ill-conditionness of the operator makes the entries in and to vary largely in magnitude and so challenges the dynamic range of the network’s weights.
1D
| 9.5E-4 | 41.95 | 1.02E+19 | 2.11E-8 | 4.95E-5 |
| Condition no. of : 44740336. | ||||
| Condition no. of : 2534652160. |
2D
| 0.078 | 0.022 | 0.030 | 3.39E-8 | 8.68E-7 |
RBF Kernel (dense + low-rank)
Used in discretization of Fredholm integral equations or Covariance matrices from Gaussian Processes (e.g., using a Radial Basis Function or Matern kernel). The matrix is dense, but the off-diagonal blocks have low numerical rank. This tests if “Block LU” architecture can learn to approximate these low-rank blocks (effectively performing an H-Matrix approximation). These matrices are Symmetric Positive Definite, so pivoting is strictly not required (Cholesky factorization is stable).
| 0.243 | 13.12 | 2.29E-8 | 6.42E-7 | |
| Condition no. of : 1201921. | ||||
| Condition no. of : 5652878336. |

We see that the network is able to produce decent approximations of even the complex operators, as shown by the absolute approximation error and backward errors. But if the underlying matrix is ill-conditioned, then the approximated LU is not good for solving like we initially anticipated given the absence of pivoting.
Conclusion
Even though lack of pivoting makes the learning based LU approximation, using FMM blocks, numerically inferior to classical gaussian elimination based methods, these experiments provide good empirical evidence that structured factorizations could be learnt using gradient descent. For eg, learning QR factorization could solve the numerical issues with lack of pivoting in the training pipeline.
References
Footnotes
-
**MNO : A Multi-modal Neural Operator for Parametric Nonlinear BVPs. Link to paper ↩ ↩2
-
**LU-Net: Invertible Neural Networks Based on Matrix Factorization. Link to paper ↩
-
**Fast Algorithms with Applications to PDEs. Link to PhD thesis ↩