An energy based model to describe probability distribution
pθ(x)=Zθe−Eθ(x), where Eθ(x) is described by a graph, Zθ=x∑e−Eθ(x)is the partition function, x is a vector consists of xi=0,1, θ is the network parameters.
Typically, we can construct a Boltzmann Machine with the energy defined as
E(x)=−21xiWijxj
[picture of boltzmann machine, as an example]
Here W matrix is the θ parameters in the above formulation. It resembles the famous Ising model.
Inference means Given part of x, guess the rest, it is based on conditional probability p(xB∣xA)=h∑p(xB∣h)p(h∣xA), which is useful in recommender systems.
However, a general energy based model is hard to make inference(or conditional probability), so we need a Restricted Boltzmann Machine
E(x)=xTWh+bTh+xTa
conditional probability p(x∣h)∝e−xT(Wh+a)=i∏e−xiΘi, where Θi is the ith element of Wh+a.
Since all variables xi are independant from each other, we can do pixel-wise sampling according to probability p(xi)∝1+e−xiΘie−xiΘi (i.e. p(xi=0)=σ(Θi))
Gibbs sampling:
conditional sampling x1→h1→x2→…→xn, will converge to p(x) and p(h).
Woodford, O. (n.d.). Notes on Contrastive Divergence.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647