# Stating the Problem¶

This chapter presents a catalog of problems that can be used as templates representing types of problems with similar solution steps on the D-Wave system.

Keep in mind that there are different ways to model a given problem; for example, a constraint satisfaction problem (CSP) can have various domains, variables, and constraints. Model selection can affect solution performance, so it may be useful to consider various approaches.

Quantum annealing processors naturally return low-energy solutions; some applications require the real minimum energy and others require good low-energy samples. This approach is best suited to solving discrete optimization problems and probabilistic sampling problems.

**Optimization problems.** In an optimization problem,
we search for the best of many possible combinations.
Optimization problems include scheduling challenges, such
as “Should I ship this package on this truck or the next one?”
or “What is the most efficient route a traveling
salesperson should take to visit different cities?”

Physics can help solve these sorts of problems because we can frame them as energy minimization problems. A fundamental rule of physics is that everything tends to seek a minimum energy state. Objects slide down hills; hot things cool down over time. This behavior is also true in the world of quantum physics. Quantum annealing simply uses quantum physics to find low-energy states of a problem and therefore the optimal or near-optimal combination of elements.

**Sampling problems.** Sampling from many low-energy states and
characterizing the shape of the energy landscape is useful for
machine learning problems where we want to build a probabilistic
model of reality. The samples give us information
about the model state for a given set of parameters, which can then be
used to improve the model.

Probabilistic models explicitly handle uncertainty by accounting for gaps in our knowledge and errors in data sources. Probability distributions represent the unobserved quantities in a model (including noise effects) and how they relate to the data. The distribution of the data is approximated based on a finite set of samples. The model infers from the observed data, and learning occurs as it transforms the prior distribution, defined before observing the data, into the posterior distribution, defined afterward. If the training process is successful, the learned distribution resembles the distribution that generated the data, allowing predictions to be made on unobserved data. For example, when training on the famous MNIST dataset of handwritten digits, such a model can generate images resembling handwritten digits that are consistent with the training set.

Sampling from energy-based distributions is a computationally intensive task that is an excellent match for the way that the D-Wave system solves problems; that is, by seeking low-energy states.

## Discrete (Combinatorial) Optimization Problems and CSPs¶

Discrete optimization, also known as combinatorial optimization, is the optimization of an objective function defined over a set of discrete values such as Booleans.

The D-Wave system can be viewed as a hardware heuristic that minimizes Ising objective functions using a physically realized version of quantum annealing. The Reformulating a Problem chapter provides techniques to reformulate various problems as Ising (or QUBO) objective functions.

**Further Information**

- [Jue2016] discusses quantum annealing for Boolean satisfiability problems.
- [Koc2004] and related papers show that QUBOs are an effective representation for modeling and solving a variety of discrete optimization problems.
- [Bar1982] shows that Ising and QUBO problems are NP-hard.
- [Luc2013] discusses Ising formulations of NP problems.

### Job-Shop Scheduling¶

The job-shop scheduling problem is to maximize priority or minimize schedule
length (known as a *makespan*, the time interval between starting the first
job and finishing the last) of multiple jobs done on several machines, where a
job is an ordered sequence of tasks performed on particular
machines, with constraints that a machine executes one task at a time and
must complete started tasks.

**Workflow Steps**

- State this problem as a CSP instance (the Background chapter gives introductory examples to this step).
- Translate integer variables to binary (the Simple Map Coloring Problem section gives an example).
- Reformulate this CSP instance as a QUBO (using the technique described in the CSP Reformulation with Penalty Functions section, for example, if you are scheduling for priority maximization).[1]
- If the problem is too large to fit on the QPU, decompose the problem (using methods such as those in the Decomposing a Large Problem chapter).[2]
- Follow the remaining steps of the general workflow of Table 3.

[1] | Scheduling for makespan minimization needs either ancillary variables, also discussed in the Reformulating a Problem chapter, or a hybrid approach, such as discussed in the Decomposing a Large Problem chapter. |

[2] | Typically, recasting from integer variables creates a large number of binary variables, many of which represent impossible combinations and so are unnecessary. For this type of problem, decomposition is likely needed. |

**Further Information**

- [Ven2015] describes an implementation of job-shop scheduling on the D-Wave system, which includes formulating the problem, translating to QUBO, and applying variable reduction techniques. It also talks about direct embedding of local constraints.

### Fault Diagnosis¶

Fault diagnosis attempts to quickly localize failures as soon as they are detected in systems such as sensor networks, process monitoring, and safety monitoring. Circuit fault diagnosis attempts to identify failed gates during manufacturing, under the assumption that gate failure is rare enough that the minimum number of gates failing is the most likely cause of the detected problem.

**Further Information**

- The Simple Circuit Fault Diagnosis Problem section in the Background chapter shows the steps of solving a circuit fault diagnosis problem on the D-Wave system.
- [Bia2016] discusses embedding fault diagnosis CSPs on the D-Wave system.
- [Bis2017] discusses a problem of diagnosing faults in an electrical power-distribution system.
- [Pap1976] discusses decomposing complex systems for the problem of generating tests for digital-faults detection.

### Additional Problems¶

Protein Folding

Protein folding refers to the way protein chains structure themselves in the context of providing some biological function. Although their constituent amino acids enable multiple configurations, proteins rarely misfold (such proteins are a cause of disease) because the standard configuration has lower energy and so is more stable.

- [Per2012] discusses using the D-Wave system to find the lowest-energy configuration for a folded protein.

Traffic-Flow Optimization

One form of the traffic-flow optimization problem is to minimize the travel time of a group of vehicles from their sources to destinations by minimizing congestion on the roads being used.

- [Flo2017] describes work done by Volkswagen to map a traffic-flow optimization problem on the D-Wave system.

Factoring

The factoring problem is to decompose a number into its factors. There is no known method to quickly factor large integers—the complexity of this problem has made it the basis of public-key cryptography algorithms.

- The Factoring section in the Extended Examples chapter is an example of factoring on the D-Wave system.
- [Dwave3] discusses integer factoring in the context of using the D-Wave Anneal Offsets feature; see also the Setting Anneal Offsets section.

Portfolio Optimization

Portfolio optimization is the problem of optimizing the allocation of a budget to a set of financial assets.

- [Els2017] discusses using Markowitz’s optimization of the financial portfolio selection problem on the D-Wave system.
- [Ros2016a] discusses solving a portfolio optimization problem on the D-Wave system.

Database Query Minimization (SAT Filters)

A satisfiability (SAT) filter is a small data structure that enables fast querying over a huge dataset by allowing for false positives (but not false negatives).

## Machine Learning Problems¶

Machine learning algorithms operate by constructing a model with parameters that can be
learned from a large amount of example input so that the trained model can make
predictions about unseen data. Most implementations to-date have been based on deterministic
machine learning models such as feed-forward neural networks. The real world, however,
is nondeterministic and filled with uncertainty. *Probabilistic*
models explicitly handle this uncertainty by accounting for gaps in our knowledge and
errors in data sources.

In probabilistic models, *probability distributions* represent the unobserved quantities in
a model (including noise effects) and how they relate to the data. The distribution of
the data is approximated based on a finite set of *samples*. The model infers from the
observed data, and learning occurs as it transforms the *prior* distributions, defined
before observing the data, into *posterior* distributions, defined afterward. If the
training process is successful, the learned distribution resembles the actual distribution
of the data to the extent that the model can make correct predictions about unseen
situations—correctly interpreting a previously unseen handwritten digit, for example.

A *Boltzmann distribution* is a discrete energy-based distribution that
defines probability, \(p\), for each of the discrete states in a binary vector.
For a set of \(N\) binary random variables, \(\vc{x}\), the Boltzmann
distribution defines a probability for each possible state that \(\vc{x}\)
can take using[3]

[3] | \(\beta\) is omitted from this equation because usually, in the context of machine learning, it is assumed to be 1. |

where \(E(\vc{x};\theta)\) is an energy function parameterized by \(\theta\) and

is the *normalizing coefficient*, also known as the *partition function*, that ensures that \(p(\vc{x})\) sums to
1 over all the possible states of \(\vc{x}\); that is,

Note that because of the negative sign for energy, \(E\), the states with high probability correspond to states with low energy.

The energy function \(E(\vc{x};\theta)\) can be represented as a QUBO: the linear coefficients bias the probability of individual binary variables in \(\vc{x}\) and the quadratic coefficients represent the correlation weights between the elements of \(\vc{x}\). The D-Wave architecture, which natively processes information through the Ising/QUBO models (linear coefficients are represented by qubit biases and quadratic coefficients by coupler strengths), can help discrete energy-based machine learning.

**Further Information**

- [Bia2010] discusses using quantum annealing for machine learning applications in two modes of operation: zero-temperature for optimization and finite-temperature for sampling.

### Probabilistic Sampling: RBM¶

A Boltzmann machine is a stochastic recurrent neural network that defines a probability distribution over binary visible and hidden variables. A restricted Boltzmann machine has no intra-connections among the visible and hidden variables.

An RBM is shown in Figure 97.

**Further Information**

- [Sal2007] describes RBMs used to model tabular data, such as users’ ratings of movies.
- [Dwave2] provides an example of an RBM.
- [Hin2012] is a tutorial on training RBMs.
- [Ben2017] discusses sampling on the D-Wave system.
- [Ada2015] describes implementing an RBM on the D-Wave system to generate samples for estimating model expectations of deep neural networks.

### Energy-Based Models¶

Machine learning with energy-based models (EBMs) minimizes an objective function by lowering scalar energy for configurations of variables that best represent dependencies for probabilistic and nonprobabilistic models.

**Workflow Steps**

- For an RBM as a generative model, for example, where the gradient needed to
maximize log-likelihood of data is intractable (due to the partition function
for the energy objective function), instead of using the standard
Gibbs’s sampling, use samples from the D-Wave system. The training will
have steps like these:
- Initialize variables.
- Teach visible nodes with training samples.
- Sample from the D-Wave system.
- Update and repeat as needed.

- For embedding, it may be possible to map variables (of the reduced problem) directly to qubits (e.g., pixels to qubits for image inputs), as shown in [Ben2017].
- Follow the remaining steps of the general workflow of Table 3.

**Further Information**

### Support Vector Machines¶

Support vector machines (SVM) find a hyperplane separating data into classes with maximized margin to each class; structured support vector machines (SSVM) assume structure in the output labels; for example, a beach in a picture increases the chance the picture is of a sunset.

**Workflow Steps**

Initialize \(\vc w=0\) where \(\vc w=[w_1,w_2,...]^T\) are weights to be learned from the training data.

Repeat until \(\vc w\) converges:

a. Minimize a regularized objective function for each training example, recording optimal solutions \(\hat{y}\), where \(\hat{y}_i\) is the maximizer for

\[\underset{\mathbf{y}\in\{0,1\}^{L}}{max} \left\{\Delta(\mathbf{y},\mathbf{y_{i}})+ \langle\mathbf{w},\mathbf{\Psi}(\mathbf{x_{i}},\mathbf{y}) -\mathbf{\Psi}(\mathbf{x_{i}},\mathbf{y_{i}})\rangle\right\}\]with features \(\Psi(\bf{x, y})\) defining a compatibility function \(f(\bf{x,y}) =<\bf{w,\Psi(\bf{x,y})}>\) on inputs \(\vc x\) and labels \(\vc y\).

Your objective function might look something like this:

\[F(w)=\frac{1}{d}\sum_{i=1}^{d}\left[\underset{y\in\{0,1\}^{L}}{max}\left\{\Delta(y,y_{i})+\langle w,\Psi(x_{i},y)-\Psi(x_{i},y_{i})\rangle\right\}\right]+\lambda\frac{||w||^{2}}{2}\]where \(\lambda\) is a regularization term to prefer simple models (avoid over-fitting) and be minimized using a subgradient such as:

\[\partial_{w}F(w)= \frac{1}{d}\sum_{i=1}^{d} \left[\Psi(x_{i},\hat{y})-\Psi(x_{i},y_{i})\right] +\lambda w.\]Submit this subgradient, formulated as a QUBO (see the Reformulating a Problem chapter), to the D-Wave system for minimization.

b. Iterate over variables: calculate \(\Delta w_j=\eta \cdot (y_i - \hat{y_i})x_j\), where \(\eta\) is the learning rate, and set the new weight \(w_j \longleftarrow w_j + \Delta w_j\).

**Further Information**

## Opportunities and Challenges¶

As noted in the Introduction chapter, quantum computing has the potential
to help solve some of the most complex problems that organizations face. However,
this cutting-edge technology still presents formidable challenges to potential
users. The *Technical Description of the D-Wave Quantum Processing Unit* guide describes many of these in detail. They include, integrated
control errors (ICE) such as neighbor interference by qubits, flux noise,
quantization errors in configuring values of biases and coupler strengths, errors
due to the finite bandwidth of the system’s I/O, and physical variations between
qubits. Additionally, system performance may be affected by temperature,
high-energy photons, and other factors.

This is typical of young, ground-breaking technologies. Persistence in overcoming such childhood problems to exploit such technologies has frequently been rewarded. Quantum computing poses some difficult challenges to adoption but offers great opportunities.