GiggleLiu - The Numeric Monster

Restricted Boltzmann Machine (RBM) for Physicsts

Data: Motivation

Given an i.i.d dataset D\mathcal D that drawn from a target probability distribution π(x)\pi(x), with xx an input vector (or image).

We want to learn a model pθ(x)π(x)p_\theta(x)\sim \pi(x).

Here D\mathcal D can be

Model: What is a Boltzmann Machine?

An energy based model to describe probability distribution

pθ(x)=eEθ(x)Zθp_\theta(x)=\frac{e^{-E_\theta(x)}}{Z_\theta}, where Eθ(x)E_\theta(x) is described by a graph, Zθ=xeEθ(x)Z_\theta=\sum\limits_xe^{-E_\theta(x)}is the partition function, xx is a vector consists of xi=0,1x_i=0,1, θ\theta is the network parameters.

Typically, we can construct a Boltzmann Machine with the energy defined as

E(x)=12xiWijxjE(x) = -\frac 1 2x_iW_{ij}x_j

[picture of boltzmann machine, as an example]bmrbm

Here WW matrix is the θ\theta parameters in the above formulation. It resembles the famous Ising model.

Loss: The criteria for optimization

We want to maximize the likehood

l=yp(y)l = \prod\limits_yp(y)

,

or equivalently minimize the negative log likelihood(NLL) loss

L=LN=1NyE(y)+logZ\begin{align}\mathcal{L}=\frac L N = \frac 1 N \sum\limits_y E(y) +\log Z\end{align}

Training: How to minimize the NLL loss?

Partition function ZZ is notoriously hard to obtain, it is one of the fundamental problem in statistic physics.

Interestingly, Lθ\frac{\partial \mathcal L}{\partial \theta} can be estimated!

Lθ=EyD[E(y)]xE(x)eE(x)Z=EyD[E(y)]Exp(x)[E(x)]\begin{align}\frac{\partial L}{\partial\theta} &= \mathbb E_{y\in \mathcal D}[E'(y)]-\frac{\sum\limits_x E'(x)e^{-E(x)}}{Z}\\&=\mathbb E_{y\in \mathcal D}[E'(y)] - \mathbb E_{x\sim p(x)}[E'(x)]\end{align}

However, classical spin statistics are not simple enough, the second term is still hard.

A energy based model suited for inference

Inference means Given part of xx, guess the rest, it is based on conditional probability p(xBxA)=hp(xBh)p(hxA)p(x_{B}\vert x_{A}) = \sum\limits_hp(x_{B}\vert h)p(h\vert x_{A}), which is useful in recommender systems.

However, a general energy based model is hard to make inference(or conditional probability), so we need a Restricted Boltzmann Machine

E(x)=xTWh+bTh+xTaE(x) = x^T\mathbf Wh + b^Th +x^Ta

bmrbm

conditional probability p(xh)exT(Wh+a)=iexiΘip(x\vert h)\propto e^{-x^T(\mathbf Wh+a)}=\prod\limits_ie^{-x_i\Theta_i}, where Θi\Theta_i is the ith element of Wh+a\mathbf Wh+a.

Since all variables xix_i are independant from each other, we can do pixel-wise sampling according to probability p(xi)exiΘi1+exiΘip(x_i)\propto \frac{e^{-x_i\Theta_i}}{1+e^{-x_i\Theta_i}} (i.e. p(xi=0)=σ(Θi)p(x_i=0)=\sigma(\Theta_i))

Gibbs sampling:

conditional sampling x1h1x2xnx_1\rightarrow h_1\rightarrow x_2 \rightarrow \ldots\rightarrow x_n, will converge to p(x)p(x) and p(h)p(h).

1). p(xtxt1)p(xt1xt)=p(xt)p(xt1)\frac{p(x_{t}\vert x_{t-1})}{p(x_{t-1}\vert x_t)}=\frac{p(x_t)}{p(x_{t-1})}

p(xtxt1)=hp(xth)p(hxt1)=hp(xt,h)p(h,xt1)p(h)p(xt1)p(x_t\vert x_{t-1}) =\sum\limits_{h} p(x_t\vert h)p(h\vert x_{t-1})=\sum\limits_h\frac{p(x_t, h)p(h, x_{t-1})}{p(h)p(x_{t-1})} p(xt1xt)=hp(xt1h)p(hxt)=hp(xt1,h)p(h,xt)p(h)p(xt)p(x_{t-1}\vert x_{t}) =\sum\limits_{h} p(x_{t-1}\vert h)p(h\vert x_{t})=\sum\limits_h\frac{p(x_{t-1}, h)p(h, x_t)}{p(h)p(x_t)}

Statistic ensemble \rightarrow Time ensemble

2). ergodicity, obvious.

Gradient:
Eθ(x)=xTa+jlog(1+e(xTW+bT)j)E_\theta(x) = -x^Ta+\sum\limits_j\log(1+e^{(-x^T W+b^T)_j}) {Eθ(x)Wij=xiσ((xTWbT)j)Eθ(x)bj=σ((xTWbT)j)Eθ(x)a=xiT \begin{cases}\frac{\partial E\theta(x)}{\partial W{ij}} &= -x_i\sigma((-x^T W-b^T)_j)\\ \frac{\partial E\theta(x)}{\partial b{j}} &= -\sigma((-x^T W-b^T)_j)\\ \frac{\partial E_\theta(x)}{\partial a} &= -x_i^T \end{cases}

Remark: usually, we don't need do this gradient stuff by hand, we have pytorch and tensorflow!

References

Woodford, O. (n.d.). Notes on Contrastive Divergence.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647

CC BY-SA 4.0 GiggleLiu. Last modified: April 04, 2024. Website built with Franklin.jl and the Julia programming language.