# Background on Probabilistic Machine Learning¶

Artificial intelligence (AI) is transforming the world. We see it every day at home, at work, when shopping, when socializing, and even when driving our cars. Machine learning algorithms operate by constructing a model with parameters that can be learned from a large amount of example input so that the trained model can make predictions about unseen data.

## Probabilistic Modeling¶

Most of the transformation that AI has brought to-date has been based on deterministic
machine learning models such as feed-forward neural networks. The real world, however,
is nondeterministic and filled with uncertainty. *Probabilistic*
models explicitly handle this uncertainty by accounting for gaps in our knowledge and
errors in data sources.

In probabilistic models, *probability distributions* represent the unobserved quantities in
a model (including noise effects) and how they relate to the data. The distribution of
the data is approximated based on a finite set of *samples*. The model infers from the
observed data, and learning occurs as it transforms the *prior* distributions, defined
before observing the data, into *posterior* distributions, defined afterward. If the
training process is successful, the learned distribution resembles the actual distribution
of the data to the extent that the model can make correct predictions about unseen
situations—correctly interpreting a previously unseen handwritten digit, for example.

In short, probabilistic modeling is a practical approach for designing machines that:

- Learn from noisy and unlabeled data
- Define confidence levels in predictions
- Allow decision making in the absence of complete information
- Infer missing data and latent correlations in data

## Boltzmann Distribution¶

A probability distribution is a mathematical function that assigns a probability value to an event.
Depending on the nature of the underlying event, this function can be defined for a continuous event
(e.g., a normal distribution) or a discrete event (e.g., a Bernoulli distribution).
A *Boltzmann distribution* belongs to the latter group. It is an energy-based distribution that
defines probability, \(p\), for each of the discrete states in a binary vector.

Let’s assume \(\vc{x}\) represents a set of \(N\) binary random variables. Conceptually, the space of \(\vc{x}\) corresponds to binary representations of all numbers from 0 to \(2^N - 1\). We can represent it as a column vector, \(\vc{x}^T = [x_1, x_2, \dots, x_N]\), where \(x_n \in \{0, 1\}\) is the state of the \(n^{th}\) binary random variable in \(\vc{x}\).

The Boltzmann distribution defines a probability for each possible state that \(\vc{x}\) can take using[1]

[1] | \(\beta\) is omitted from this equation because usually, in the context of machine learning, it is assumed to be 1. |

where \(E(\vc{x};\theta)\) is an energy function parameterized by \(\theta\), which contains the biases, and

is the *normalizing coefficient*, also known as the *partition function*, that ensures that \(p(\vc{x})\) sums to
1 over all the possible states of \(x\); that is,

Note that because of the negative sign for energy, \(E\), the states with high probability correspond to states with low energy.

The energy function \(E(\vc{x};\theta)\) is represented via a quadratic form \(\vc{x}^{\rm T} Q\vc{x}\) in which the matrix \(Q_{\theta}\) is defined by the biases (\(q_i\)) and correlation weights (\(q_{i,j}\)). So, \(Q\) is an upper-triangular matrix that encapsulates the parameters of the quadratic energy function defined by

Note

Above we have used \(x_i x_i = x_i\) as \(x_i \in \{0, 1\}\).

The off-diagonal entries of \(Q\) represent the correlation weights between the elements of \(\vc{x}\). If \(q_{i,j}\) is small (a negative value with a large magnitude), then \(x_i\) and \(x_j\) are more likely to be 1 at the same time. On the other hand, the diagonal entries of \(Q\) bias the probability of individual binary variables in \(x\). For example, if \(q_i\) is large, then \(x_i\) is more likely to be zero.

Key points:

- The energy function is given by a formula that turns each vector, \(\vc{x}\), into a number that need be neither positive nor normalized.
- The energy function is a quadratic function of \(\vc{x}\) determined by \(Q\), which is a matrix representing all the node bias (\(q_{i,i}\)) and edge weight (\(q_{i,j}\)) parameters.
- We exponentiate the energy to obtain the unnormalized probability.
- We divide by \({Z}\) to ensure normalization (that the result sums to 1).

We can create a simple *Boltzmann machine* by using two binary random variables, \(\vc{x}^T = [x_1, x_2]\). Let’s assume that \(Q\) has the
following form:

In this case we have:

We can use the energy function defined above to compute the probability of each state:

\(x_1\) \(x_2\) | \(E_Q(\vc{x})\) | \(p(\vc{x})\) |
---|---|---|

0 0 | 0 | 0.13 |

0 1 | -1 | 0.37 |

1 0 | -1 | 0.37 |

1 1 | 0 | 0.13 |

In this example, \(Z = \exp(0) + \exp(1) + \exp(1) + \exp(0) = 7.44\). Note that we chose \(Q\) such that the states in which either \(x_1\) or \(x_2\) is 1 have high probability and the states in which both or neither of them is 1 have low probability.

## Restricted Boltzmann Machines¶

A *restricted Boltzmann machine* (RBM) is a special type of Boltzmann machine with a
symmetrical *bipartite* structure; see Figure 31. It defines a
probability distribution over a set of binary variables that are divided into visible (input), \(\vc{v}\),
and hidden, \(\vc{h}\), variables, which are analogous to the retina and brain, respectively.[2]
The hidden variables allow for more complex dependencies among visible variables and are often
used to learn a stochastic generative model over a set of inputs. All visible variables connect
to all hidden variables, but no variables in the same layer are linked. This limited connectivity
makes inference and therefore learning easier because the RBM takes only a single step to reach
thermal equilibrium if we *clamp* the visible variables to particular binary states.

[2] | Analogy courtesy of Pedro Domingos in The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World.
Basic Books, 2015. |

During the learning process, each visible variable is responsible for a feature from an item in the dataset
to be learned. For example, images from the famous MNIST dataset of handwritten digits[3] have 784
pixels, so the RBMs that are training from this dataset require 784 visible variables. Each variable has
a *bias* and each connection between variables has a *weight*. These values
determine the energy of the output.

[3] | http://yann.lecun.com/exdb/mnist/ |

Without the introduction of hidden variables, the energy function \(E(\vc{x})\) by itself is not sufficiently flexible to give good models. We write \(\vc{x}=[\vc{v},\vc{h}]\) and denote the energy function as \(E(\vc{v},\vc{h})\).

Then,

and we are interested in

which we obtain by marginalizing over the hidden variables, \(\vc{h}\).

## Training¶

A standard training criterion used to determine the energy function is to *maximize* the
log likelihood (LL) of the training data—or, equivalently, to *minimize* the negative log likelihood (NLL) of the data.
Training data is repetitively fed to the RBM and corresponding improvements made to the model.

When training a Boltzmann machine, we are given \(D\) training (visible) examples \(\vc{v}^{(1)}, ..., \vc{v}^{(D)}\), and would like to find a setting for \(\theta\) under which this data is highly likely. Note that \(n^{th}\) component of the \(d^{th}\) training example is \(v_n^{(d)}\).

To find \(\theta\), we maximize the likelihood of the training data:

- The likelihood is \(L(\theta) = \prod_{d=1}^D p(v^{(d)};\theta)\)
- It is more convenient, computationally, to maximize the log likelihood:

We use the *gradient descent* method to minimize the \(NLL(\theta)\):

- Starting at an initial guess for \(\theta\) (say, all zero values) we calculate the gradient (the direction of fastest improvement) and then take a step in that direction.
- We then iterate by taking the gradient at the new point and moving downhill again.

To calculate the gradient at a particular \(\theta\), we must evaluate some expected values: \(E_{p(\vc{x};\theta)} f(\vc{x})\) for a set of functions \(f(\vc{x})\) known as the sufficient statistics. The expected values cannot be determined exactly, because we cannot sum over all \(2^N\) configurations; therefore, we approximate by only summing over the most probable configurations, which we obtain by sampling from the distribution given by the current \(\theta\).

## Ising Hamiltonian¶

Using a transform of variables, one can write the binary variables \((0,1)\) in terms of spin states \((-1,1)\). Doing so maps the QUBO problem to an equivalent Ising Hamiltonian. This \(N\)-variable system has \(2^N\) classical states, and can be represented by the classical Ising Hamiltonian operator (matrix):

where \(\sigma^z = \begin{bmatrix} -1 & 0 \\ 0 & +1 \end{bmatrix}\), \(b_a\) is \(q_i\), and \(w_{ab}\) is \(q_{i,j}\).

This Hamiltonian is a diagonal matrix with dimensions \(2^N\) by \(2^N\). The classical states are eigenvectors (that is, fundamental states) of the Hamiltonian. Each eigenvector has associated with it an eigenvalue, which in this case is the energy of the state.

So we have a \(2^N\)-dimensional Hamiltonian whose \(2^N\) eigenvectors are classical states, each with its energy as the associated eigenvalue. In these terms we can now think about the classical Boltzmann distribution: a probability distribution on the eigenvectors of a Hamiltonian such that the probability of choosing an eigenvector is related to its eigenvalue.