Restricted Boltzmann Machine (RBM) for Physicsts

Data: Motivation

Given an i.i.d dataset $\mathcal D$ that drawn from a target probability distribution $\pi(x)$, with $x$ an input vector (or image).

We want to learn a model $p_\theta(x)\sim \pi(x)$.

Here $\mathcal D$ can be

  • an image dataset,
  • monte carlo samples,
  • measurement results of a quantum circuit.

Model: What is a Boltzmann Machine?

An energy based model to describe probability distribution

$p_\theta(x)=\frac{e^{-E_\theta(x)}}{Z_\theta}$, where $E_\theta(x)$ is described by a graph, $Z_\theta=\sum\limits_xe^{-E_\theta(x)}$is the partition function, $x$ is a vector consists of $x_i=0,1$, $\theta$ is the network parameters.

Typically, we can construct a Boltzmann Machine with the energy defined as

$E(x) = -\frac 1 2x_iW_{ij}x_j$

[picture of boltzmann machine, as an example]bmrbm

Here $W$ matrix is the $\theta$ parameters in the above formulation. It resembles the famous Ising model.

Loss: The criteria for optimization

We want to maximize the likehood

,

or equivalently minimize the negative log likelihood(NLL) loss

Training: How to minimize the NLL loss?

Partition function $Z$ is notoriously hard to obtain, it is one of the fundamental problem in statistic physics.

Interestingly, $\frac{\partial \mathcal L}{\partial \theta}$ can be estimated!

However, classical spin statistics are not simple enough, the second term is still hard.

A energy based model suited for inference

Inference means Given part of $x$, guess the rest, it is based on conditional probability $p(x_{B}\vert x_{A}) = \sum\limits_hp(x_{B}\vert h)p(h\vert x_{A})$, which is useful in recommender systems.

However, a general energy based model is hard to make inference(or conditional probability), so we need a Restricted Boltzmann Machine

bmrbm

conditional probability $p(x\vert h)\propto e^{-x^T(\mathbf Wh+a)}=\prod\limits_ie^{-x_i\Theta_i}$, where $\Theta_i$ is the ith element of $\mathbf Wh+a$.

Since all variables $x_i$ are independant from each other, we can do pixel-wise sampling according to probability $p(x_i)\propto \frac{e^{-x_i\Theta_i}}{1+e^{-x_i\Theta_i}}$ (i.e. $p(x_i=0)=\sigma(\Theta_i)$)

Gibbs sampling:

conditional sampling $x_1\rightarrow h_1\rightarrow x_2 \rightarrow \ldots\rightarrow x_n$, will converge to $p(x)$ and $p(h)$.

1). $\frac{p(x_{t}\vert x_{t-1})}{p(x_{t-1}\vert x_t)}=\frac{p(x_t)}{p(x_{t-1})}$

$p(x_t\vert x_{t-1}) =\sum\limits_{h} p(x_t\vert h)p(h\vert x_{t-1})=\sum\limits_h\frac{p(x_t, h)p(h, x_{t-1})}{p(h)p(x_{t-1})}$

$p(x_{t-1}\vert x_{t}) =\sum\limits_{h} p(x_{t-1}\vert h)p(h\vert x_{t})=\sum\limits_h\frac{p(x_{t-1}, h)p(h, x_t)}{p(h)p(x_t)}$

Statistic ensemble $\rightarrow$ Time ensemble

2). ergodicity, obvious.

Gradient:

$E_\theta(x) = -x^Ta+\sum\limits_j\log(1+e^{(-x^T W+b^T)_j})$ $$ \begin{cases}\frac{\partial E\theta(x)}{\partial W{ij}} &= -x_i\sigma((-x^T W-b^T)_j)\

\frac{\partial E\theta(x)}{\partial b{j}} &= -\sigma((-x^T W-b^T)_j)\

\frac{\partial E_\theta(x)}{\partial a} &= -x_i^T \end{cases} $$ Remark: usually, we don’t need do this gradient stuff by hand, we have pytorch and tensorflow!

References

Woodford, O. (n.d.). Notes on Contrastive Divergence.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647