= ( a ∪ b ∪ c … ) \ o ∑ A a B b C c … , where a ⃗ = a 1 , a 2 … \vec a = a_1, a_2\dots a = a 1 , a 2 … are labels that appear in tensor A A A , a ⃗ ∪ b ⃗ \vec a\cup \vec b a ∪ b means the union of two sets of labels, a ⃗ \ b ⃗ \vec a\backslash \vec b a \ b means setdiff between two sets of labels. The above sumation runs over all indices that does not appear in output tensor O O O .
Given O ‾ \overline O O , In order to to obtain B ‾ ≡ ∂ L / ∂ B \overline B \equiv \partial \mathcal{L}/\partial B B ≡ ∂ L / ∂ B , consider the the diff rule
δ L = ∑ o ⃗ O ‾ o ⃗ δ O o ⃗ = ∑ o ⃗ ∪ a ⃗ ∪ b ⃗ ∪ c ⃗ … O ‾ o ⃗ A a ⃗ δ B b ⃗ C c ⃗ … \begin{align} \delta \mathcal{L} &= \sum\limits_{\vec o} \overline O_{\vec o} \delta O_{\vec o} \\ &=\sum\limits_{\vec o\cup\vec a \cup \vec b\cup \vec c \ldots} \overline O_{\vec o}A_{\vec a}\delta B_{\vec b}C_{\vec c} \ldots \end{align} δ L = o ∑ O o δ O o = o ∪ a ∪ b ∪ c … ∑ O o A a δ B b C c … Here, we have used the (partial) differential equation
δ O o ⃗ = ∑ ( a ⃗ ∪ b ⃗ ∪ c ⃗ … ) \ o ⃗ A a ⃗ δ B b ⃗ C c ⃗ … \delta O_{\vec o} = \sum\limits_{(\vec a \cup \vec b \cup \vec c \ldots) \backslash \vec o }A_{\vec a}\delta B_{\vec b}C_{\vec c} \ldots δ O o = ( a ∪ b ∪ c … ) \ o ∑ A a δ B b C c … Then we define
B ‾ b ⃗ = ∑ ( a ⃗ ∪ b ⃗ ∪ c ⃗ … ) \ b ⃗ A a ⃗ O ‾ o ⃗ C c ⃗ … , \overline B_{\vec b} = \sum\limits_{(\vec a \cup \vec b \cup \vec c \ldots) \backslash \vec b }A_{\vec a}\overline O_{\vec o}C_{\vec c} \ldots, B b = ( a ∪ b ∪ c … ) \ b ∑ A a O o C c … , We can readily verify
δ L = ∑ b ⃗ B ‾ b ⃗ δ B b ⃗ \delta \mathcal{L} = \sum\limits_{\vec b} \overline B_{\vec b} \delta B_{\vec b} δ L = b ∑ B b δ B b This backward rule is exactly an einsum that exchange output tensor O O O and input tensor B B B .
In conclusion, the index magic
of exchanging indices as backward rule holds for einsum.
Thank Andreas Peter for helpful discussion.
references :
https://arxiv.org/abs/1710.08717
A = U E U † A = UEU^\dagger A = U E U † We have
A ‾ = U [ E ‾ + 1 2 ( U ‾ † U ∘ F + h . c . ) ] U † \overline{A} = U\left[\overline{E} + \frac{1}{2}\left(\overline{U}^\dagger U \circ F + h.c.\right)\right]U^\dagger A = U [ E + 2 1 ( U † U ∘ F + h . c . ) ] U † Where F i j = ( E j − E i ) − 1 F_{ij}=(E_j- E_i)^{-1} F ij = ( E j − E i ) − 1 .
If E E E is continuous, we define the density ρ ( E ) = ∑ k δ ( E − E k ) = − 1 π ∫ k ℑ [ G r ( E , k ) ] \rho(E) = \sum\limits_k \delta(E-E_k)=-\frac{1}{\pi}\int_k \Im[G^r(E, k)] ρ ( E ) = k ∑ δ ( E − E k ) = − π 1 ∫ k ℑ [ G r ( E , k )] (check sign!). Where G r ( E , k ) = 1 E − E k + i δ G^r(E, k) = \frac{1}{E-E_k+i\delta} G r ( E , k ) = E − E k + i δ 1 .
We have
A ‾ = U [ E ‾ + 1 2 ( U ‾ † U ∘ ℜ [ G ( E i , E j ) ] + h . c . ) ] U † \overline{A} = U\left[\overline{E} + \frac{1}{2}\left(\overline{U}^\dagger U \circ \Re [G(E_i, E_j)] + h.c.\right)\right]U^\dagger A = U [ E + 2 1 ( U † U ∘ ℜ [ G ( E i , E j )] + h . c . ) ] U † references :
https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf
https://j-towns.github.io/papers/svd-derivative.pdf
https://arxiv.org/abs/1907.13422
Complex valued SVD is defined as A = U S V † A = USV^\dagger A = U S V † . For simplicity, we consider a full rank square matrix A A A . Differentiation gives
d A = d U S V † + U d S V † + U S d V † dA = dUSV^\dagger + U dS V^\dagger + USdV^\dagger d A = d U S V † + U d S V † + U S d V † U † d A V = U † d U S + d S + S d V † V U^\dagger dA V = U^\dagger dU S + dS + SdV^\dagger V U † d A V = U † d U S + d S + S d V † V Defining matrices d C = U † d U dC=U^\dagger dU d C = U † d U and d D = d V † V dD = dV^\dagger V d D = d V † V and d P = U † d A V dP = U^\dagger dA V d P = U † d A V , then we have
{ d C † + d C = 0 , d D † + d D = 0 \begin{cases}dC^\dagger+dC=0,\\dD^\dagger +dD=0\end{cases} { d C † + d C = 0 , d D † + d D = 0 We have
d P = d C S + d S + S d D dP = dC S + dS + SdD d P = d CS + d S + S d D where d C S dCS d CS and S d D SdD S d D has zero real part in diagonal elements. So that d S = ℜ [ d i a g ( d P ) ] dS = \Re[{\rm diag}(dP)] d S = ℜ [ diag ( d P )] .
d L = T r [ A ‾ T d A + A ∗ ‾ T d A ∗ ] = T r [ A ‾ T d A + d A † A ‾ ∗ ] # r u l e 3 \begin{aligned} d\mathcal{L} &= {\rm Tr}\left[\overline{A}^TdA+\overline{A^*}^TdA^*\right]\\ &= {\rm Tr}\left[\overline{A}^TdA+dA^\dagger\overline{A}^*\right] ~~~~~~~\#rule~3 \end{aligned} d L = Tr [ A T d A + A ∗ T d A ∗ ] = Tr [ A T d A + d A † A ∗ ] # r u l e 3 Easy to show A ‾ s = U ∗ S ‾ V T \overline A_s = U^*\overline SV^T A s = U ∗ S V T . Notice here, A ‾ \overline A A is the derivative rather than gradient , they are different by a conjugate, this is why we have transpose rather than conjugate here. see my complex valued autodiff blog for detail.
Using the relations d C † + d C = 0 dC^\dagger+dC=0 d C † + d C = 0 and d D † + d D = 0 dD^\dagger+dD=0 d D † + d D = 0
{ d P S + S d P † = d C S 2 − S 2 d C S d P + d P † S = S 2 d D − d D S 2 \begin{cases} dPS + SdP^\dagger &= dC S^2-S^2dC\\ SdP + dP^\dagger S &= S^2dD-dD S^2 \end{cases} { d PS + S d P † S d P + d P † S = d C S 2 − S 2 d C = S 2 d D − d D S 2 { d C = F ∘ ( d P S + S d P † ) d D = − F ∘ ( S d P + d P † S ) \begin{cases} dC = F\circ(dPS+SdP^\dagger)\\ dD = -F\circ (SdP+dP^\dagger S) \end{cases} { d C = F ∘ ( d PS + S d P † ) d D = − F ∘ ( S d P + d P † S ) where F i j = 1 s j 2 − s i 2 F_{ij} = \frac{1}{s_j^2-s_i^2} F ij = s j 2 − s i 2 1 , easy to verify F T = − F F^T = -F F T = − F . Notice here, the relation between the imaginary diagonal parts is lost
ℑ [ I ∘ d P ] = ℑ [ I ∘ ( d C + d D ) ] \color{red}{\Im[I\circ dP] = \Im[I\circ(dC+dD)]} ℑ [ I ∘ d P ] = ℑ [ I ∘ ( d C + d D )] This the missing diagonal imaginary part is definitely not trivial, but has been ignored for a long time until @refraction-ray (Shixin Zhang) mentioned and solved it. Let's first focus on the off-diagonal contributions from d U dU d U
T r U ‾ T d U = T r U ‾ T U d C + U ‾ T ( I − U U † ) d A V S − 1 = T r U ‾ T U ( F ∘ ( d P S + S d P † ) ) = T r ( d P S + S d P † ) ( − F ∘ ( U ‾ T U ) ) # r u l e 1 , 2 = T r ( d P S + S d P † ) J T \begin{align} {\rm Tr}\overline U^TdU &= {\rm Tr} \overline U ^TU dC + \overline U^T (I-UU^\dagger) dAVS^{-1}\\ &= {\rm Tr}\overline U^T U (F\circ(dPS+SdP^\dagger))\\ &= {\rm Tr}(dPS+SdP^\dagger)(-F\circ (\overline U^T U)) \# rule~1,2\\ &= {\rm Tr}(dPS+SdP^\dagger)J^T \end{align} Tr U T d U = Tr U T U d C + U T ( I − U U † ) d A V S − 1 = Tr U T U ( F ∘ ( d PS + S d P † )) = Tr ( d PS + S d P † ) ( − F ∘ ( U T U )) # r u l e 1 , 2 = Tr ( d PS + S d P † ) J T Here, we defined J = F ∘ ( U T U ‾ ) J=F\circ(U^T\overline U) J = F ∘ ( U T U ) .
d L = T r ( d P S + S d P † ) ( J + J † ) T = T r d P S ( J + J † ) T + h . c . = T r U † d A V S ( J + J † ) T + h . c . = T r [ V S ( J + J † ) T U † ] d A + h . c . \begin{align*} d\mathcal L &= {\rm Tr} (dPS+SdP^\dagger)(J+J^\dagger)^T\\ &= {\rm Tr} dPS(J+J^\dagger)^T+h.c.\\ &= {\rm Tr} U^\dagger dA V S(J+J^\dagger)^T+h.c.\\ &= {\rm Tr}\left[ VS(J+J^\dagger)^TU^\dagger\right] dA+h.c. \end{align*} d L = Tr ( d PS + S d P † ) ( J + J † ) T = Tr d PS ( J + J † ) T + h . c . = Tr U † d A V S ( J + J † ) T + h . c . = Tr [ V S ( J + J † ) T U † ] d A + h . c . By comparing with d L = T r [ A ‾ T d A + h . c . ] d\mathcal L = {\rm Tr}\left[\overline{A}^TdA+h.c. \right] d L = Tr [ A T d A + h . c . ] , we have
A ˉ U ( r e a l ) = [ V S ( J + J † ) T U † ] T = U ∗ ( J + J † ) S V T \begin{align} \bar A_U^{(\rm real)} &= \left[VS(J+J^\dagger)^TU^\dagger\right]^T\\ &=U^*(J+J^\dagger)SV^T \end{align} A ˉ U ( real ) = [ V S ( J + J † ) T U † ] T = U ∗ ( J + J † ) S V T Now let's inspect the diagonal imaginary parts of d C dC d C and d D dD d D in Eq. 16. At a first glance, it is not sufficient to derive d C dC d C and d D dD d D from d P dP d P , but consider there is still an information not used, the loss must be gauge invariant , which means
L ( U Λ , S , V Λ ) \mathcal{L}(U\Lambda, S, V\Lambda) L ( U Λ , S , V Λ ) Should be independent of the choice of gauge Λ \Lambda Λ , which is defined as d i a g ( e i ϕ , . . . ) {\rm diag}(e^i\phi, ...) diag ( e i ϕ , ... )
d L = T r [ U Λ ‾ T d ( U Λ ) + S ‾ d S + V Λ ‾ T d ( V Λ ) ] + h . c . = T r [ U Λ ‾ T ( d U Λ + U d Λ ) + S ‾ d S + V Λ ‾ T ( V d Λ + d V Λ ) ] + h . c . = T r [ ( U Λ ‾ T U + V Λ ‾ T V ) d Λ ] + … + h . c . \begin{aligned} d\mathcal{L} &={\rm Tr}[ \overline{U\Lambda}^T d(U\Lambda) +\overline SdS+\overline{V\Lambda}^Td(V\Lambda)] + h.c.\\ &={\rm Tr}[ \overline {U\Lambda}^T (dU\Lambda+Ud\Lambda) +\overline{S}dS+ \overline{V\Lambda}^T(Vd\Lambda +dV\Lambda)] + h.c.\\ &= {\rm Tr}[(\overline{U\Lambda}^TU+\overline{V\Lambda}^TV )d\Lambda ] + \ldots + h.c. \end{aligned} d L = Tr [ U Λ T d ( U Λ ) + S d S + V Λ T d ( V Λ )] + h . c . = Tr [ U Λ T ( d U Λ + U d Λ ) + S d S + V Λ T ( V d Λ + d V Λ )] + h . c . = Tr [( U Λ T U + V Λ T V ) d Λ ] + … + h . c . Gauge invariance refers to
Λ ‾ = I ∘ ( U Λ ‾ T U + V Λ ‾ T V ) = 0 \overline{\Lambda} = I\circ(\overline{U\Lambda}^TU+\overline{V\Lambda}^TV) = 0 Λ = I ∘ ( U Λ T U + V Λ T V ) = 0 For any Λ \Lambda Λ , where I I I refers to the diagonal mask matrix. It is of cause valid when Λ → 1 \Lambda\rightarrow1 Λ → 1 , I ∘ ( U ‾ T U + V ‾ T V ) = 0 I\circ(\overline{U}^TU+\overline V^TV) = 0 I ∘ ( U T U + V T V ) = 0 .
Consider the contribution from the diagonal imaginary part , we have
T r [ U ‾ T U ( I ∘ ℑ [ d C ] ) + V ‾ T V ( I ∘ ℑ [ d D † ] ) ] + h . c . = T r [ I ∘ ( U ‾ T U ) ℑ [ d C ] − I ∘ ( V ‾ T V ) ℑ [ d D ] ] + h . c . # r u l e 1 = T r [ I ∘ ( U ‾ T U ) ( ℑ [ d C ] + ℑ [ d D ] ) ] = T r [ I ∘ ( U ‾ T U ) ℑ [ d P ] S − 1 ] = T r [ S − 1 Λ J U † d A V ] \begin{aligned} &{\rm Tr} [\overline U^T U (I \circ \Im [dC])+\overline V^T V (I \circ \Im [dD^\dagger])] + h.c.\\ &={\rm Tr} [ I \circ (\overline U^T U)\Im [dC]-I\circ (\overline V^T V) \Im [dD]] +h.c. ~~~~~~~~~~~~~~\# rule 1\\ &={\rm Tr} [ I \circ (\overline U^T U)(\Im [dC]+ \Im [dD])] \\ &={\rm Tr}[I\circ (\overline U^T U) \Im[dP]S^{-1}] \\ &={\rm Tr}[S^{-1}\Lambda_J U^{\dagger}dA V]\\ \end{aligned} Tr [ U T U ( I ∘ ℑ [ d C ]) + V T V ( I ∘ ℑ [ d D † ])] + h . c . = Tr [ I ∘ ( U T U ) ℑ [ d C ] − I ∘ ( V T V ) ℑ [ d D ]] + h . c . # r u l e 1 = Tr [ I ∘ ( U T U ) ( ℑ [ d C ] + ℑ [ d D ])] = Tr [ I ∘ ( U T U ) ℑ [ d P ] S − 1 ] = Tr [ S − 1 Λ J U † d A V ] where Λ J = ℑ [ I ∘ ( U ‾ T U ) ] = 1 2 I ∘ ( U ‾ T U ) − h . c . \Lambda_J = \Im[I\circ(\overline U^TU)]= \frac 1 2I\circ(\overline U^TU)-h.c. Λ J = ℑ [ I ∘ ( U T U )] = 2 1 I ∘ ( U T U ) − h . c . , with I I I the mask for diagonal part. Since only the real part contribute to δ L \delta \mathcal{L} δ L (the imaginary part will be canceled by the Hermitian conjugate counterpart), we can safely move ℑ \Im ℑ from right to left.
A ˉ U + V ( i m a g ) = U ∗ Λ J S − 1 V T \begin{aligned} \color{red}{\bar A_{U+V}^{(\rm imag)} = U^*\Lambda_J S^{-1}V^T} \end{aligned} A ˉ U + V ( imag ) = U ∗ Λ J S − 1 V T Thanks @refraction-ray (Shixin Zhang) for sharing his idea in the first time. This is the issue for discussion . His arXiv preprint is coming out soon.
When U U U is not full rank , this formula should take an extra term (Ref. 2)
A ˉ U ( r e a l ) = U ∗ ( J + J † ) S V T + ( V S − 1 U ‾ T ( I − U U † ) ) T \begin{aligned} \bar A_U^{(\rm real)} &=U^*(J+J^\dagger)SV^T + (VS^{-1}\overline U^T(I-UU^\dagger))^T \end{aligned} A ˉ U ( real ) = U ∗ ( J + J † ) S V T + ( V S − 1 U T ( I − U U † ) ) T Similarly, for V V V we have
A ‾ V ( r e a l ) = U ∗ S ( K + K † ) V T + ( U S − 1 V ‾ T ( I − V V † ) ) ∗ , \begin{aligned} \overline A_V^{(\rm real)} &=U^*S(K+K^\dagger)V^T + (U S^{-1} \overline V^T (I - VV^\dagger))^*, \end{aligned} A V ( real ) = U ∗ S ( K + K † ) V T + ( U S − 1 V T ( I − V V † ) ) ∗ , where K = F ∘ ( V T V ‾ ) K=F\circ(V^T\overline V) K = F ∘ ( V T V ) .
To wrap up
A ‾ = A ‾ U ( r e a l ) + A ‾ S + A ‾ V ( r e a l ) + A ‾ U + V ( i m a g ) \overline A = \overline A_U^{\rm (real)} + \overline A_S + \overline A_V^{\rm (real)} + \overline A_{U+V}^{\rm (imag)} A = A U ( real ) + A S + A V ( real ) + A U + V ( imag ) This result can be directly used in autograd .
For the gradient used in training, one should change the convention
A ‾ = A ‾ ∗ , U ‾ = U ‾ ∗ , V ‾ = V ‾ ∗ . \mathcal{\overline A} = \overline A^*,\\ \mathcal{\overline U} = \overline U^*,\\ \mathcal{\overline V}= \overline V^*. A = A ∗ , U = U ∗ , V = V ∗ . This convention is used in tensorflow , Zygote.jl . Which is
A ‾ = U ( J + J † ) S V † + ( I − U U † ) U ‾ S − 1 V † + U S ‾ V † + U S ( K + K † ) V † + U S − 1 V ‾ † ( I − V V † ) + 1 2 U ( I ∘ ( U † U ‾ ) − h . c . ) S − 1 V † \begin{aligned} \mathcal{\overline A} =& U(\mathcal{J}+\mathcal{J}^\dagger)SV^\dagger + (I-UU^\dagger)\mathcal{\overline U}S^{-1}V^\dagger\\ &+ U\overline SV^\dagger\\ &+US(\mathcal{K}+\mathcal{K}^\dagger)V^\dagger + U S^{-1} \mathcal{\overline V}^\dagger (I - VV^\dagger)\\ &\color{red}{+\frac 1 2 U (I\circ(U^\dagger\overline U)-h.c.)S^{-1}V^\dagger} \end{aligned} A = U ( J + J † ) S V † + ( I − U U † ) U S − 1 V † + U S V † + U S ( K + K † ) V † + U S − 1 V † ( I − V V † ) + 2 1 U ( I ∘ ( U † U ) − h . c . ) S − 1 V † where J = F ∘ ( U † U ‾ ) J=F\circ(U^\dagger\mathcal{\overline U}) J = F ∘ ( U † U ) and K = F ∘ ( V † V ‾ ) K=F\circ(V^\dagger \mathcal{\overline V}) K = F ∘ ( V † V ) .
rule 1. T r [ A ( C ∘ B ) ] = ∑ A T ∘ C ∘ B = T r ( ( C ∘ A T ) T B ) = T r ( C T ∘ A ) B {\rm Tr} \left[A(C\circ B\right)] = \sum A^T\circ C\circ B = {\rm Tr} ((C\circ A^T)^TB)={\rm Tr}(C^T\circ A)B Tr [ A ( C ∘ B ) ] = ∑ A T ∘ C ∘ B = Tr (( C ∘ A T ) T B ) = Tr ( C T ∘ A ) B
rule2. ( C ∘ A ) T = C T ∘ A T (C\circ A)^T = C^T \circ A^T ( C ∘ A ) T = C T ∘ A T
rule3. When L \mathcal L L is real,
∂ L ∂ x ∗ = ( ∂ L ∂ x ) ∗ \frac{\partial \mathcal{L}}{\partial x^*} = \left(\frac{\partial \mathcal{L}}{\partial x}\right)^* ∂ x ∗ ∂ L = ( ∂ x ∂ L ) ∗ e.g. To test the adjoint contribution from U U U , we can construct a gauge insensitive test function
function loss(A)
U, S, V = svd(A)
psi = U[:,1 ]
psi'*H*psi
end
function gradient(A)
U, S, V = svd(A)
dU = zero(U)
dS = zero(S)
dV = zero(V)
dU[:,1 ] = U[:,1 ]'*H
dA = svd_back(U, S, V, dU, dS, dV)
dA
end
references :
https://arxiv.org/abs/1710.08717
https://arxiv.org/abs/1903.09650
A = Q R A = QR A = QR with Q † Q = I Q^\dagger Q = \mathbb{I} Q † Q = I , so that d Q † Q + Q † d Q = 0 dQ^\dagger Q+Q^\dagger dQ=0 d Q † Q + Q † d Q = 0 . R R R is a complex upper triangular matrix, with diagonal part real.
d A = d Q R + Q d R dA = dQR+QdR d A = d QR + Q d R d Q = d A R − 1 − Q d R R − 1 dQ = dAR^{-1}-QdRR^{-1} d Q = d A R − 1 − Q d R R − 1 { Q † d Q = d C − d R R − 1 d Q † Q = d C † − R − † d R † \begin{cases} Q^\dagger dQ = dC - dRR^{-1}\\ dQ^\dagger Q =dC^\dagger - R^{-\dagger}dR^\dagger \end{cases} { Q † d Q = d C − d R R − 1 d Q † Q = d C † − R −† d R † where d C = Q † d A R − 1 dC=Q^\dagger dAR^{-1} d C = Q † d A R − 1 .
Then
d C + d C † = d R R − 1 + ( d R R − 1 ) † dC+dC^\dagger = dRR^{-1} +(dRR^{-1})^\dagger d C + d C † = d R R − 1 + ( d R R − 1 ) † Notice d R dR d R is upper triangular and its diag is lower triangular, this restriction gives
U ∘ ( d C + d C † ) = d R R − 1 U\circ(dC+dC^\dagger) = dRR^{-1} U ∘ ( d C + d C † ) = d R R − 1 where U U U is a mask operator that its element value is 1 1 1 for upper triangular part, 0.5 0.5 0.5 for diagonal part and 0 0 0 for lower triangular part. One should also notice here both R R R and d R dR d R has real diagonal parts, as well as the product d R R − 1 dRR^{-1} d R R − 1 .
Now let's wrap up using the Zygote convension of gradient
d L = T r [ Q ‾ † d Q + R ‾ † d R + h . c . ] = T r [ Q ‾ † d A R − 1 − Q ‾ † Q d R R − 1 + R ‾ † d R + h . c . ] = T r [ R − 1 Q ‾ † d A + R − 1 ( − Q ‾ † Q + R R ‾ † ) d R + h . c . ] = T r [ R − 1 Q ‾ † d A + R − 1 M d R + h . c . ] \begin{align} d\mathcal L &= {\rm Tr}\left[\overline{\mathcal{Q}}^\dagger dQ+\overline{\mathcal{R}}^\dagger dR +h.c. \right]\\ &={\rm Tr}\left[\overline{\mathcal{Q}}^\dagger dA R^{-1}-\overline{\mathcal{Q}}^\dagger QdR R^{-1}+\overline{\mathcal{R}}^\dagger dR +h.c. \right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA+ R^{-1}(-\overline{\mathcal{Q}}^\dagger Q +R\overline{\mathcal{R}}^\dagger) dR +h.c. \right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA+ R^{-1}M dR +h.c. \right] \end{align} d L = Tr [ Q † d Q + R † d R + h . c . ] = Tr [ Q † d A R − 1 − Q † Q d R R − 1 + R † d R + h . c . ] = Tr [ R − 1 Q † d A + R − 1 ( − Q † Q + R R † ) d R + h . c . ] = Tr [ R − 1 Q † d A + R − 1 M d R + h . c . ] here, M = R R ‾ † − Q ‾ † Q M=R\overline{\mathcal{R}}^\dagger-\overline{\mathcal{Q}}^\dagger Q M = R R † − Q † Q . Plug in d R dR d R we have
d L = T r [ R − 1 Q ‾ † d A + M [ U ∘ ( d C + d C † ) ] + h . c . ] = T r [ R − 1 Q ‾ † d A + ( M ∘ L ) ( d C + d C † ) + h . c . ] # r u l e 1 = T r [ ( R − 1 Q ‾ † d A + h . c . ) + ( M ∘ L ) ( d C + d C † ) + ( M ∘ L ) † ( d C + d C † ) ] = T r [ R − 1 Q ‾ † d A + ( M ∘ L + h . c . ) d C + h . c . ] = T r [ R − 1 Q ‾ † d A + ( M ∘ L + h . c . ) Q † d A R − 1 ] + h . c . \begin{align} d\mathcal{L}&={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA + M \left[U\circ(dC+dC^\dagger)\right] +h.c. \right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA + (M\circ L)(dC+dC^\dagger) +h.c. \right] \;\;\# rule\; 1\\ &={\rm Tr}\left[ (R^{-1}\overline{\mathcal{Q}}^\dagger dA+h.c.) + (M\circ L)(dC + dC^\dagger)+ (M\circ L)^\dagger (dC + dC^\dagger)\right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA + (M\circ L+h.c.)dC + h.c.\right]\\ &={\rm Tr}\left[ R^{-1}\overline{\mathcal{Q}}^\dagger dA + (M\circ L+h.c.)Q^\dagger dAR^{-1}\right]+h.c.\\ \end{align} d L = Tr [ R − 1 Q † d A + M [ U ∘ ( d C + d C † ) ] + h . c . ] = Tr [ R − 1 Q † d A + ( M ∘ L ) ( d C + d C † ) + h . c . ] # r u l e 1 = Tr [ ( R − 1 Q † d A + h . c . ) + ( M ∘ L ) ( d C + d C † ) + ( M ∘ L ) † ( d C + d C † ) ] = Tr [ R − 1 Q † d A + ( M ∘ L + h . c . ) d C + h . c . ] = Tr [ R − 1 Q † d A + ( M ∘ L + h . c . ) Q † d A R − 1 ] + h . c . where L = U † = 1 − U L =U^\dagger = 1-U L = U † = 1 − U is the mask of lower triangular part of a matrix.
A ‾ † = R − 1 [ Q ‾ † + ( M ∘ L + h . c . ) Q † ] A ‾ = [ Q ‾ + Q ( M ∘ L + h . c . ) ] R − † = [ Q ‾ + Q copyltu ( M ) ] R − † \begin{align} \mathcal{\overline A}^\dagger &= R^{-1}\left[\overline{\mathcal{Q}}^\dagger + (M\circ L+h.c.)Q^\dagger\right]\\ \mathcal{\overline A} &= \left[\overline{\mathcal{Q}} + Q(M\circ L+h.c.)\right]R^{-\dagger}\\ &=\left[\overline{\mathcal{Q}} + Q \texttt{copyltu}(M)\right]R^{-\dagger} \end{align} A † A = R − 1 [ Q † + ( M ∘ L + h . c . ) Q † ] = [ Q + Q ( M ∘ L + h . c . ) ] R −† = [ Q + Q copyltu ( M ) ] R −† Here, the copyltu \texttt{copyltu} copyltu takes conjugate when copying elements to upper triangular part.