Here W W W matrix is the θ \theta θ parameters in the above formulation. It resembles the famous Ising model.
We want to maximize the likehood
l = ∏ y p ( y ) l = \prod\limits_yp(y) l = y ∏ p ( y ) ,
or equivalently minimize the negative log likelihood(NLL) loss
L = L N = 1 N ∑ y E ( y ) + log Z \begin{align}\mathcal{L}=\frac L N = \frac 1 N \sum\limits_y E(y) +\log Z\end{align} L = N L = N 1 y ∑ E ( y ) + log Z Partition function Z Z Z is notoriously hard to obtain, it is one of the fundamental problem in statistic physics.
Interestingly, ∂ L ∂ θ \frac{\partial \mathcal L}{\partial \theta} ∂ θ ∂ L can be estimated!
∂ L ∂ θ = E y ∈ D [ E ′ ( y ) ] − ∑ x E ′ ( x ) e − E ( x ) Z = E y ∈ D [ E ′ ( y ) ] − E x ∼ p ( x ) [ E ′ ( x ) ] \begin{align}\frac{\partial L}{\partial\theta} &= \mathbb E_{y\in \mathcal D}[E'(y)]-\frac{\sum\limits_x E'(x)e^{-E(x)}}{Z}\\&=\mathbb E_{y\in \mathcal D}[E'(y)] - \mathbb E_{x\sim p(x)}[E'(x)]\end{align} ∂ θ ∂ L = E y ∈ D [ E ′ ( y )] − Z x ∑ E ′ ( x ) e − E ( x ) = E y ∈ D [ E ′ ( y )] − E x ∼ p ( x ) [ E ′ ( x )] However, classical spin statistics are not simple enough, the second term is still hard.
Inference means Given part of x x x , guess the rest, it is based on conditional probability p ( x B ∣ x A ) = ∑ h p ( x B ∣ h ) p ( h ∣ x A ) p(x_{B}\vert x_{A}) = \sum\limits_hp(x_{B}\vert h)p(h\vert x_{A}) p ( x B ∣ x A ) = h ∑ p ( x B ∣ h ) p ( h ∣ x A ) , which is useful in recommender systems.
However, a general energy based model is hard to make inference (or conditional probability), so we need a Restricted Boltzmann Machine
E ( x ) = x T W h + b T h + x T a E(x) = x^T\mathbf Wh + b^Th +x^Ta E ( x ) = x T W h + b T h + x T a
conditional probability p ( x ∣ h ) ∝ e − x T ( W h + a ) = ∏ i e − x i Θ i p(x\vert h)\propto e^{-x^T(\mathbf Wh+a)}=\prod\limits_ie^{-x_i\Theta_i} p ( x ∣ h ) ∝ e − x T ( W h + a ) = i ∏ e − x i Θ i , where Θ i \Theta_i Θ i is the ith element of W h + a \mathbf Wh+a W h + a .
Since all variables x i x_i x i are independant from each other, we can do pixel-wise sampling according to probability p ( x i ) ∝ e − x i Θ i 1 + e − x i Θ i p(x_i)\propto \frac{e^{-x_i\Theta_i}}{1+e^{-x_i\Theta_i}} p ( x i ) ∝ 1 + e − x i Θ i e − x i Θ i (i.e. p ( x i = 0 ) = σ ( Θ i ) p(x_i=0)=\sigma(\Theta_i) p ( x i = 0 ) = σ ( Θ i ) )
Gibbs sampling:
conditional sampling x 1 → h 1 → x 2 → … → x n x_1\rightarrow h_1\rightarrow x_2 \rightarrow \ldots\rightarrow x_n x 1 → h 1 → x 2 → … → x n , will converge to p ( x ) p(x) p ( x ) and p ( h ) p(h) p ( h ) .
1). p ( x t ∣ x t − 1 ) p ( x t − 1 ∣ x t ) = p ( x t ) p ( x t − 1 ) \frac{p(x_{t}\vert x_{t-1})}{p(x_{t-1}\vert x_t)}=\frac{p(x_t)}{p(x_{t-1})} p ( x t − 1 ∣ x t ) p ( x t ∣ x t − 1 ) = p ( x t − 1 ) p ( x t )
p ( x t ∣ x t − 1 ) = ∑ h p ( x t ∣ h ) p ( h ∣ x t − 1 ) = ∑ h p ( x t , h ) p ( h , x t − 1 ) p ( h ) p ( x t − 1 ) p(x_t\vert x_{t-1}) =\sum\limits_{h} p(x_t\vert h)p(h\vert x_{t-1})=\sum\limits_h\frac{p(x_t, h)p(h, x_{t-1})}{p(h)p(x_{t-1})} p ( x t ∣ x t − 1 ) = h ∑ p ( x t ∣ h ) p ( h ∣ x t − 1 ) = h ∑ p ( h ) p ( x t − 1 ) p ( x t , h ) p ( h , x t − 1 ) p ( x t − 1 ∣ x t ) = ∑ h p ( x t − 1 ∣ h ) p ( h ∣ x t ) = ∑ h p ( x t − 1 , h ) p ( h , x t ) p ( h ) p ( x t ) p(x_{t-1}\vert x_{t}) =\sum\limits_{h} p(x_{t-1}\vert h)p(h\vert x_{t})=\sum\limits_h\frac{p(x_{t-1}, h)p(h, x_t)}{p(h)p(x_t)} p ( x t − 1 ∣ x t ) = h ∑ p ( x t − 1 ∣ h ) p ( h ∣ x t ) = h ∑ p ( h ) p ( x t ) p ( x t − 1 , h ) p ( h , x t ) Statistic ensemble → \rightarrow → Time ensemble
2). ergodicity, obvious.
E θ ( x ) = − x T a + ∑ j log ( 1 + e ( − x T W + b T ) j ) E_\theta(x) = -x^Ta+\sum\limits_j\log(1+e^{(-x^T W+b^T)_j}) E θ ( x ) = − x T a + j ∑ log ( 1 + e ( − x T W + b T ) j ) { ∂ E θ ( x ) ∂ W i j = − x i σ ( ( − x T W − b T ) j ) ∂ E θ ( x ) ∂ b j = − σ ( ( − x T W − b T ) j ) ∂ E θ ( x ) ∂ a = − x i T \begin{cases}\frac{\partial E\theta(x)}{\partial W{ij}} &= -x_i\sigma((-x^T W-b^T)_j)\\ \frac{\partial E\theta(x)}{\partial b{j}} &= -\sigma((-x^T W-b^T)_j)\\ \frac{\partial E_\theta(x)}{\partial a} &= -x_i^T \end{cases} ⎩ ⎨ ⎧ ∂ W ij ∂ Eθ ( x ) ∂ b j ∂ Eθ ( x ) ∂ a ∂ E θ ( x ) = − x i σ (( − x T W − b T ) j ) = − σ (( − x T W − b T ) j ) = − x i T Remark: usually, we don't need do this gradient stuff by hand, we have pytorch and tensorflow !
Woodford, O. (n.d.). Notes on Contrastive Divergence.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science , 313 (5786), 504–507. https://doi.org/10.1126/science.1127647