Linear Algebra Autodiff (complex valued)

You can find the Julia implementations in BackwardsLinalg.jl and OMEinsum.jl.

Einsum

Definition of Einsum

einsum is defined as

where $\vec a = a_1, a_2\dots$ are labels that appear in tensor $A$, $\vec a\cup \vec b$ means the union of two sets of labels, $\vec a\backslash \vec b$ means setdiff between two sets of labels. The above sumation runs over all indices that does not appear in output tensor $O$.

Autodiff

Given $\overline O$, In order to to obtain $\overline B \equiv \partial \mathcal{L}/\partial B$, consider the the diff rule

Here, we have used the (partial) differential equation

Then we define

We can readily verify

This backward rule is exactly an einsum that exchange output tensor $O$ and input tensor $B$.

In conclusion, the index magic of exchanging indices as backward rule holds for einsum.

Thank Andreas Peter for helpful discussion.

Symmetric Eigenvalue Decomposition (ED)

references:

  1. https://arxiv.org/abs/1710.08717

We have

Where $F_{ij}=(E_j- E_i)^{-1}$.

If $E$ is continuous, we define the density $\rho(E) = \sum\limits_k \delta(E-E_k)=-\frac{1}{\pi}\int_k \Im[G^r(E, k)] $ (check sign!). Where $G^r(E, k) = \frac{1}{E-E_k+i\delta}$.

We have

Singular Value Decomposition (SVD)

references:

  1. https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf
  2. https://j-towns.github.io/papers/svd-derivative.pdf
  3. https://arxiv.org/abs/1907.13422

Complex valued SVD is defined as $A = USV^\dagger$. For simplicity, we consider a full rank square matrix $A$. Differentiation gives

Defining matrices $dC=U^\dagger dU$ and $dD = dV^\dagger V$ and $dP = U^\dagger dA V$, then we have

We have

where $dCS$ and $SdD$ has zero real part in diagonal elements. So that $dS = \Re[{\rm diag}(dP)]$.

Easy to show $\overline A_s = U^*\overline SV^T$. Notice here, $\overline A$ is the derivative rather than gradient, they are different by a conjugate, this is why we have transpose rather than conjugate here. see my complex valued autodiff blog for detail.

Using the relations $dC^\dagger+dC=0$ and $dD^\dagger+dD=0$

where $F_{ij} = \frac{1}{s_j^2-s_i^2}$, easy to verify $F^T = -F$. Notice here, the relation between the imaginary diagonal parts is lost

This the missing diagonal imaginary part is definitely not trivial, but has been ignored for a long time until @refraction-ray (Shixin Zhang) mentioned and solved it. Let’s first focus on the off-diagonal contributions from $dU$

Here, we defined $J=F\circ(U^T\overline U)$.

By comparing with $d\mathcal L = {\rm Tr}\left[\overline{A}^TdA+h.c. \right]$, we have

Update: The missing diagonal imaginary part

Now let’s inspect the diagonal imaginary parts of $dC$ and $dD$ in Eq. 1. At a first glance, it is not sufficient to derive $dC$ and $dD$ from $dP$, but consider there is still an information not used, the loss must be gauge invariant, which means

Should be independent of the choice of gauge $\Lambda$, which is defined as ${\rm diag}(e^i\phi, …)$

Gauge invariance refers to

For any $\Lambda$, where $I$ refers to the diagonal mask matrix. It is of cause valid when $\Lambda\rightarrow1$, $I\circ(\overline{U}^TU+\overline V^TV) = 0$.

Consider the contribution from the diagonal imaginary part, we have

where $\Lambda_J = \Im[I\circ(\overline U^TU)]= \frac 1 2I\circ(\overline U^TU)-h.c.$, with $I$ the mask for diagonal part. Since only the real part contribute to $\delta \mathcal{L}$ (the imaginary part will be canceled by the Hermitian conjugate counterpart), we can safely move $\Im$ from right to left.

Thanks @refraction-ray (Shixin Zhang) for sharing his idea in the first time. This is the issue for discussion. His arXiv preprint is coming out soon.

When $U$ is not full rank, this formula should take an extra term (Ref. 2)

Similarly, for $V​$ we have

where $K=F\circ(V^T\overline V)​$.

To wrap up

This result can be directly used in autograd.

For the gradient used in training, one should change the convention

This convention is used in tensorflow, Zygote.jl. Which is

where $J=F\circ(U^\dagger\mathcal{\overline U})$ and $K=F\circ(V^\dagger \mathcal{\overline V})$.

Rules

rule 1. ${\rm Tr} \left[A(C\circ B\right)] = \sum A^T\circ C\circ B = {\rm Tr} ((C\circ A^T)^TB)={\rm Tr}(C^T\circ A)B$

rule2. $(C\circ A)^T = C^T \circ A^T$

rule3. When $\mathcal L$ is real,

How to Test SVD

e.g. To test the adjoint contribution from $U$, we can construct a gauge insensitive test function

# H is a random Hermitian Matrix
function loss(A)
    U, S, V = svd(A)
    psi = U[:,1]
    psi'*H*psi
end

function gradient(A)
    U, S, V = svd(A)
    dU = zero(U)
    dS = zero(S)
    dV = zero(V)
    dU[:,1] = U[:,1]'*H
    dA = svd_back(U, S, V, dU, dS, dV)
    dA
end

QR decomposition

references:

  1. https://arxiv.org/abs/1710.08717
  2. https://arxiv.org/abs/1903.09650

with $Q^\dagger Q = \mathbb{I}$, so that $dQ^\dagger Q+Q^\dagger dQ=0$. $R$ is a complex upper triangular matrix, with diagonal part real.

where $dC=Q^\dagger dAR^{-1}$.

Then

Notice $dR$ is upper triangular and its diag is lower triangular, this restriction gives

where $U$ is a mask operator that its element value is $1$ for upper triangular part, $0.5$ for diagonal part and $0$ for lower triangular part. One should also notice here both $R$ and $dR$ has real diagonal parts, as well as the product $dRR^{-1}$.

Now let’s wrap up using the Zygote convension of gradient

here, $M=R\overline{\mathcal{R}}^\dagger-\overline{\mathcal{Q}}^\dagger Q$. Plug in $dR$ we have

where $L =U^\dagger = 1-U$ is the mask of lower triangular part of a matrix.

Here, the $\texttt{copyltu}​$ takes conjugate when copying elements to upper triangular part.