Linear Algebra Autodiff (complex valued)

You can find the Julia implementations in BackwardsLinalg.jl and OMEinsum.jl.

Einsum

Definition of Einsum

einsum is defined as

O_{\vec o} = \sum\limits_{(\vec a \cup \vec b \cup \vec c \ldots) \backslash \vec o }A_{\vec a}B_{\vec b}C_{\vec c} \ldots,

where $\vec a = a_1, a_2\dots$ are labels that appear in tensor $A$ , $\vec a\cup \vec b$ means the union of two sets of labels, $\vec a\backslash \vec b$ means setdiff between two sets of labels. The above sumation runs over all indices that does not appear in output tensor $O$ .

Autodiff

Given $\overline O$ , In order to to obtain $\overline B \equiv \partial \mathcal{L}/\partial B$ , consider the the diff rule

\begin{align} \delta \mathcal{L} &= \sum\limits_{\vec o} \overline O_{\vec o} \delta O_{\vec o} \\ &=\sum\limits_{\vec o\cup\vec a \cup \vec b\cup \vec c \ldots} \overline O_{\vec o}A_{\vec a}\delta B_{\vec b}C_{\vec c} \ldots \end{align}

Here, we have used the (partial) differential equation

\delta O_{\vec o} = \sum\limits_{(\vec a \cup \vec b \cup \vec c \ldots) \backslash \vec o }A_{\vec a}\delta B_{\vec b}C_{\vec c} \ldots

Then we define

\overline B_{\vec b} = \sum\limits_{(\vec a \cup \vec b \cup \vec c \ldots) \backslash \vec b }A_{\vec a}\overline O_{\vec o}C_{\vec c} \ldots,

We can readily verify

\delta \mathcal{L} = \sum\limits_{\vec b} \overline B_{\vec b} \delta B_{\vec b}

This backward rule is exactly an einsum that exchange output tensor $O$ and input tensor $B$ .

In conclusion, the index magic of exchanging indices as backward rule holds for einsum.

Thank Andreas Peter for helpful discussion.

Symmetric Eigenvalue Decomposition (ED)

references:

https://arxiv.org/abs/1710.08717

A = UEU^\dagger

We have

\overline{A} = U\left[\overline{E} + \frac{1}{2}\left(\overline{U}^\dagger U \circ F + h.c.\right)\right]U^\dagger

Where $F_{ij}=(E_j- E_i)^{-1}$ .

If $E$ is continuous, we define the density $\rho(E) = \sum\limits_k \delta(E-E_k)=-\frac{1}{\pi}\int_k \Im[G^r(E, k)]$ (check sign!). Where $G^r(E, k) = \frac{1}{E-E_k+i\delta}$ .

We have

\overline{A} = U\left[\overline{E} + \frac{1}{2}\left(\overline{U}^\dagger U \circ \Re [G(E_i, E_j)] + h.c.\right)\right]U^\dagger

Singular Value Decomposition (SVD)

references:

Complex valued SVD is defined as $A = USV^\dagger$ . For simplicity, we consider a full rank square matrix $A$ . Differentiation gives

dA = dUSV^\dagger + U dS V^\dagger + USdV^\dagger

U^\dagger dA V = U^\dagger dU S + dS + SdV^\dagger V

Defining matrices $dC=U^\dagger dU$ and $dD = dV^\dagger V$ and $dP = U^\dagger dA V$ , then we have

\begin{cases}dC^\dagger+dC=0,\\dD^\dagger +dD=0\end{cases}

We have

dP = dC S + dS + SdD

where $dCS$ and $SdD$ has zero real part in diagonal elements. So that $dS = \Re[{\rm diag}(dP)]$ .

\begin{aligned} d\mathcal{L} &= {\rm Tr}\left[\overline{A}^TdA+\overline{A^*}^TdA^*\right]\\ &= {\rm Tr}\left[\overline{A}^TdA+dA^\dagger\overline{A}^*\right] ~~~~~~~\#rule~3 \end{aligned}

Easy to show $\overline A_s = U^*\overline SV^T$ . Notice here, $\overline A$ is the derivative rather than gradient, they are different by a conjugate, this is why we have transpose rather than conjugate here. see my complex valued autodiff blog for detail.

Using the relations $dC^\dagger+dC=0$ and $dD^\dagger+dD=0$

\begin{cases} dPS + SdP^\dagger &= dC S^2-S^2dC\\ SdP + dP^\dagger S &= S^2dD-dD S^2 \end{cases}

\begin{cases} dC = F\circ(dPS+SdP^\dagger)\\ dD = -F\circ (SdP+dP^\dagger S) \end{cases}

where $F_{ij} = \frac{1}{s_j^2-s_i^2}$ , easy to verify $F^T = -F$ . Notice here, the relation between the imaginary diagonal parts is lost

\color{red}{\Im[I\circ dP] = \Im[I\circ(dC+dD)]}

This the missing diagonal imaginary part is definitely not trivial, but has been ignored for a long time until @refraction-ray (Shixin Zhang) mentioned and solved it. Let's first focus on the off-diagonal contributions from $dU$

\begin{align} {\rm Tr}\overline U^TdU &= {\rm Tr} \overline U ^TU dC + \overline U^T (I-UU^\dagger) dAVS^{-1}\\ &= {\rm Tr}\overline U^T U (F\circ(dPS+SdP^\dagger))\\ &= {\rm Tr}(dPS+SdP^\dagger)(-F\circ (\overline U^T U)) \# rule~1,2\\ &= {\rm Tr}(dPS+SdP^\dagger)J^T \end{align}

Here, we defined $J=F\circ(U^T\overline U)$ .

\begin{align*} d\mathcal L &= {\rm Tr} (dPS+SdP^\dagger)(J+J^\dagger)^T\\ &= {\rm Tr} dPS(J+J^\dagger)^T+h.c.\\ &= {\rm Tr} U^\dagger dA V S(J+J^\dagger)^T+h.c.\\ &= {\rm Tr}\left[ VS(J+J^\dagger)^TU^\dagger\right] dA+h.c. \end{align*}

By comparing with $d\mathcal L = {\rm Tr}\left[\overline{A}^TdA+h.c. \right]$ , we have

\begin{align} \bar A_U^{(\rm real)} &= \left[VS(J+J^\dagger)^TU^\dagger\right]^T\\ &=U^*(J+J^\dagger)SV^T \end{align}

Update: The missing diagonal imaginary part

Now let's inspect the diagonal imaginary parts of $dC$ and $dD$ in Eq. 16. At a first glance, it is not sufficient to derive $dC$ and $dD$ from $dP$ , but consider there is still an information not used, the loss must be gauge invariant, which means

\mathcal{L}(U\Lambda, S, V\Lambda)

Should be independent of the choice of gauge $\Lambda$ , which is defined as ${\rm diag}(e^i\phi, ...)$

\begin{aligned} d\mathcal{L} &={\rm Tr}[ \overline{U\Lambda}^T d(U\Lambda) +\overline SdS+\overline{V\Lambda}^Td(V\Lambda)] + h.c.\\ &={\rm Tr}[ \overline {U\Lambda}^T (dU\Lambda+Ud\Lambda) +\overline{S}dS+ \overline{V\Lambda}^T(Vd\Lambda +dV\Lambda)] + h.c.\\ &= {\rm Tr}[(\overline{U\Lambda}^TU+\overline{V\Lambda}^TV )d\Lambda ] + \ldots + h.c. \end{aligned}

Gauge invariance refers to

\overline{\Lambda} = I\circ(\overline{U\Lambda}^TU+\overline{V\Lambda}^TV) = 0

For any $\Lambda$ , where $I$ refers to the diagonal mask matrix. It is of cause valid when $\Lambda\rightarrow1$ , $I\circ(\overline{U}^TU+\overline V^TV) = 0$ .

Consider the contribution from the diagonal imaginary part, we have

\begin{aligned} &{\rm Tr} [\overline U^T U (I \circ \Im [dC])+\overline V^T V (I \circ \Im [dD^\dagger])] + h.c.\\ &={\rm Tr} [ I \circ (\overline U^T U)\Im [dC]-I\circ (\overline V^T V) \Im [dD]] +h.c. ~~~~~~~~~~~~~~\# rule 1\\ &={\rm Tr} [ I \circ (\overline U^T U)(\Im [dC]+ \Im [dD])] \\ &={\rm Tr}[I\circ (\overline U^T U) \Im[dP]S^{-1}] \\ &={\rm Tr}[S^{-1}\Lambda_J U^{\dagger}dA V]\\ \end{aligned}

where $\Lambda_J = \Im[I\circ(\overline U^TU)]= \frac 1 2I\circ(\overline U^TU)-h.c.$ , with $I$ the mask for diagonal part. Since only the real part contribute to $\delta \mathcal{L}$ (the imaginary part will be canceled by the Hermitian conjugate counterpart), we can safely move $\Im$ from right to left.

\begin{aligned} \color{red}{\bar A_{U+V}^{(\rm imag)} = U^*\Lambda_J S^{-1}V^T} \end{aligned}

Thanks @refraction-ray (Shixin Zhang) for sharing his idea in the first time. This is the issue for discussion. His arXiv preprint is coming out soon.

When $U$ is not full rank, this formula should take an extra term (Ref. 2)

\begin{aligned} \bar A_U^{(\rm real)} &=U^*(J+J^\dagger)SV^T + (VS^{-1}\overline U^T(I-UU^\dagger))^T \end{aligned}

Similarly, for $V$ we have

\begin{aligned} \overline A_V^{(\rm real)} &=U^*S(K+K^\dagger)V^T + (U S^{-1} \overline V^T (I - VV^\dagger))^*, \end{aligned}

where $K=F\circ(V^T\overline V)$ .

To wrap up

\overline A = \overline A_U^{\rm (real)} + \overline A_S + \overline A_V^{\rm (real)} + \overline A_{U+V}^{\rm (imag)}

This result can be directly used in autograd.

For the gradient used in training, one should change the convention

\mathcal{\overline A} = \overline A^*,\\ \mathcal{\overline U} = \overline U^*,\\ \mathcal{\overline V}= \overline V^*.

This convention is used in tensorflow, Zygote.jl. Which is

\begin{aligned} \mathcal{\overline A} =& U(\mathcal{J}+\mathcal{J}^\dagger)SV^\dagger + (I-UU^\dagger)\mathcal{\overline U}S^{-1}V^\dagger\\ &+ U\overline SV^\dagger\\ &+US(\mathcal{K}+\mathcal{K}^\dagger)V^\dagger + U S^{-1} \mathcal{\overline V}^\dagger (I - VV^\dagger)\\ &\color{red}{+\frac 1 2 U (I\circ(U^\dagger\overline U)-h.c.)S^{-1}V^\dagger} \end{aligned}

where $J=F\circ(U^\dagger\mathcal{\overline U})$ and $K=F\circ(V^\dagger \mathcal{\overline V})$ .

Rules

rule 1. ${\rm Tr} \left[A(C\circ B\right)] = \sum A^T\circ C\circ B = {\rm Tr} ((C\circ A^T)^TB)={\rm Tr}(C^T\circ A)B$

rule2. $(C\circ A)^T = C^T \circ A^T$

rule3. When $\mathcal L$ is real,

\frac{\partial \mathcal{L}}{\partial x^*} = \left(\frac{\partial \mathcal{L}}{\partial x}\right)^*

How to Test SVD

e.g. To test the adjoint contribution from $U$ , we can construct a gauge insensitive test function

# H is a random Hermitian Matrix
function loss(A)
    U, S, V = svd(A)
    psi = U[:,1]
    psi'*H*psi
end

function gradient(A)
    U, S, V = svd(A)
    dU = zero(U)
    dS = zero(S)
    dV = zero(V)
    dU[:,1] = U[:,1]'*H
    dA = svd_back(U, S, V, dU, dS, dV)
    dA
end

QR decomposition

references:

A = QR

with $Q^\dagger Q = \mathbb{I}$ , so that $dQ^\dagger Q+Q^\dagger dQ=0$ . $R$ is a complex upper triangular matrix, with diagonal part real.

dA = dQR+QdR

dQ = dAR^{-1}-QdRR^{-1}

\begin{cases} Q^\dagger dQ = dC - dRR^{-1}\\ dQ^\dagger Q =dC^\dagger - R^{-\dagger}dR^\dagger \end{cases}

where $dC=Q^\dagger dAR^{-1}$ .

Then

dC+dC^\dagger = dRR^{-1} +(dRR^{-1})^\dagger

Notice $dR$ is upper triangular and its diag is lower triangular, this restriction gives

U\circ(dC+dC^\dagger) = dRR^{-1}

where $U$ is a mask operator that its element value is $1$ for upper triangular part, $0.5$ for diagonal part and $0$ for lower triangular part. One should also notice here both $R$ and $dR$ has real diagonal parts, as well as the product $dRR^{-1}$ .

Now let's wrap up using the Zygote convension of gradient

\begin{align} d\mathcal L &= {\rm Tr}\left[\overline{\mathcal{Q}}^\dagger dQ+\overline{\mathcal{R}}^\dagger dR +h.c. \right]\\ &={\rm Tr}\left[\overline{\mathcal{Q}}^\dagger dA R^{-1}-\overline{\mathcal{Q}}^\dagger QdR R^{-1}+\overline{\mathcal{R}}^\dagger dR +h.c. \right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA+ R^{-1}(-\overline{\mathcal{Q}}^\dagger Q +R\overline{\mathcal{R}}^\dagger) dR +h.c. \right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA+ R^{-1}M dR +h.c. \right] \end{align}

here, $M=R\overline{\mathcal{R}}^\dagger-\overline{\mathcal{Q}}^\dagger Q$ . Plug in $dR$ we have

\begin{align} d\mathcal{L}&={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA + M \left[U\circ(dC+dC^\dagger)\right] +h.c. \right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA + (M\circ L)(dC+dC^\dagger) +h.c. \right] \;\;\# rule\; 1\\ &={\rm Tr}\left[ (R^{-1}\overline{\mathcal{Q}}^\dagger dA+h.c.) + (M\circ L)(dC + dC^\dagger)+ (M\circ L)^\dagger (dC + dC^\dagger)\right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA + (M\circ L+h.c.)dC + h.c.\right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA + (M\circ L+h.c.)Q^\dagger dAR^{-1}\right]+h.c.\\ \end{align}

where $L =U^\dagger = 1-U$ is the mask of lower triangular part of a matrix.

\begin{align} \mathcal{\overline A}^\dagger &= R^{-1}\left[\overline{\mathcal{Q}}^\dagger + (M\circ L+h.c.)Q^\dagger\right]\\ \mathcal{\overline A} &= \left[\overline{\mathcal{Q}} + Q(M\circ L+h.c.)\right]R^{-\dagger}\\ &=\left[\overline{\mathcal{Q}} + Q \texttt{copyltu}(M)\right]R^{-\dagger} \end{align}

Here, the $\texttt{copyltu}$ takes conjugate when copying elements to upper triangular part.

CC BY-SA 4.0 GiggleLiu. Last modified: April 04, 2024. Website built with Franklin.jl and the Julia programming language.