Restricted Boltzmann Machine (RBM) for Physicsts

Data: Motivation

Given an i.i.d dataset $\mathcal D$ that drawn from a target probability distribution $\pi(x)$ , with $x$ an input vector (or image).

We want to learn a model $p_\theta(x)\sim \pi(x)$ .

Here $\mathcal D$ can be

an image dataset,
monte carlo samples,
measurement results of a quantum circuit.

Model: What is a Boltzmann Machine?

An energy based model to describe probability distribution

$p_\theta(x)=\frac{e^{-E_\theta(x)}}{Z_\theta}$ , where $E_\theta(x)$ is described by a graph, $Z_\theta=\sum\limits_xe^{-E_\theta(x)}$ is the partition function, $x$ is a vector consists of $x_i=0,1$ , $\theta$ is the network parameters.

Typically, we can construct a Boltzmann Machine with the energy defined as

E(x) = -\frac 1 2x_iW_{ij}x_j

[picture of boltzmann machine, as an example] bmrbm

Here $W$ matrix is the $\theta$ parameters in the above formulation. It resembles the famous Ising model.

Loss: The criteria for optimization

We want to maximize the likehood

l = \prod\limits_yp(y)

or equivalently minimize the negative log likelihood(NLL) loss

\begin{align}\mathcal{L}=\frac L N = \frac 1 N \sum\limits_y E(y) +\log Z\end{align}

Training: How to minimize the NLL loss?

Partition function $Z$ is notoriously hard to obtain, it is one of the fundamental problem in statistic physics.

Interestingly, $\frac{\partial \mathcal L}{\partial \theta}$ can be estimated!

\begin{align}\frac{\partial L}{\partial\theta} &= \mathbb E_{y\in \mathcal D}[E'(y)]-\frac{\sum\limits_x E'(x)e^{-E(x)}}{Z}\\&=\mathbb E_{y\in \mathcal D}[E'(y)] - \mathbb E_{x\sim p(x)}[E'(x)]\end{align}

However, classical spin statistics are not simple enough, the second term is still hard.

A energy based model suited for inference

Inference means Given part of $x$ , guess the rest, it is based on conditional probability $p(x_{B}\vert x_{A}) = \sum\limits_hp(x_{B}\vert h)p(h\vert x_{A})$ , which is useful in recommender systems.

However, a general energy based model is hard to make inference(or conditional probability), so we need a Restricted Boltzmann Machine

E(x) = x^T\mathbf Wh + b^Th +x^Ta

bmrbm

conditional probability $p(x\vert h)\propto e^{-x^T(\mathbf Wh+a)}=\prod\limits_ie^{-x_i\Theta_i}$ , where $\Theta_i$ is the ith element of $\mathbf Wh+a$ .

Since all variables $x_i$ are independant from each other, we can do pixel-wise sampling according to probability $p(x_i)\propto \frac{e^{-x_i\Theta_i}}{1+e^{-x_i\Theta_i}}$ (i.e. $p(x_i=0)=\sigma(\Theta_i)$ )

Gibbs sampling:

conditional sampling $x_1\rightarrow h_1\rightarrow x_2 \rightarrow \ldots\rightarrow x_n$ , will converge to $p(x)$ and $p(h)$ .

1). $\frac{p(x_{t}\vert x_{t-1})}{p(x_{t-1}\vert x_t)}=\frac{p(x_t)}{p(x_{t-1})}$

p(x_t\vert x_{t-1}) =\sum\limits_{h} p(x_t\vert h)p(h\vert x_{t-1})=\sum\limits_h\frac{p(x_t, h)p(h, x_{t-1})}{p(h)p(x_{t-1})}

p(x_{t-1}\vert x_{t}) =\sum\limits_{h} p(x_{t-1}\vert h)p(h\vert x_{t})=\sum\limits_h\frac{p(x_{t-1}, h)p(h, x_t)}{p(h)p(x_t)}

Statistic ensemble $\rightarrow$ Time ensemble

2). ergodicity, obvious.

Gradient:

E_\theta(x) = -x^Ta+\sum\limits_j\log(1+e^{(-x^T W+b^T)_j})

\begin{cases}\frac{\partial E\theta(x)}{\partial W{ij}} &= -x_i\sigma((-x^T W-b^T)_j)\\ \frac{\partial E\theta(x)}{\partial b{j}} &= -\sigma((-x^T W-b^T)_j)\\ \frac{\partial E_\theta(x)}{\partial a} &= -x_i^T \end{cases}

Remark: usually, we don't need do this gradient stuff by hand, we have pytorch and tensorflow!

References

Woodford, O. (n.d.). Notes on Contrastive Divergence.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647

CC BY-SA 4.0 GiggleLiu. Last modified: April 04, 2024. Website built with Franklin.jl and the Julia programming language.