Stating the Problem

This chapter presents a catalog of problems that can be used as templates representing types of problems with similar solution steps on the D-Wave system.

Keep in mind that there are different ways to model a given problem; for example, a constraint satisfaction problem (CSP) can have various domains, variables, and constraints. Model selection can affect solution performance, so it may be useful to consider various approaches.

Quantum annealing processors naturally return low-energy solutions; some applications require the real minimum energy and others require good low-energy samples. This approach is best suited to solving discrete optimization problems and probabilistic sampling problems.

Optimization problems. In an optimization problem, we search for the best of many possible combinations. Optimization problems include scheduling challenges, such as “Should I ship this package on this truck or the next one?” or “What is the most efficient route a traveling salesperson should take to visit different cities?”

Physics can help solve these sorts of problems because we can frame them as energy minimization problems. A fundamental rule of physics is that everything tends to seek a minimum energy state. Objects slide down hills; hot things cool down over time. This behavior is also true in the world of quantum physics. Quantum annealing simply uses quantum physics to find low-energy states of a problem and therefore the optimal or near-optimal combination of elements.

Sampling problems. Sampling from many low-energy states and characterizing the shape of the energy landscape is useful for machine learning problems where we want to build a probabilistic model of reality. The samples give us information about the model state for a given set of parameters, which can then be used to improve the model.

Probabilistic models explicitly handle uncertainty by accounting for gaps in our knowledge and errors in data sources. Probability distributions represent the unobserved quantities in a model (including noise effects) and how they relate to the data. The distribution of the data is approximated based on a finite set of samples. The model infers from the observed data, and learning occurs as it transforms the prior distribution, defined before observing the data, into the posterior distribution, defined afterward. If the training process is successful, the learned distribution resembles the distribution that generated the data, allowing predictions to be made on unobserved data. For example, when training on the famous MNIST dataset of handwritten digits, such a model can generate images resembling handwritten digits that are consistent with the training set.

Sampling from energy-based distributions is a computationally intensive task that is an excellent match for the way that the D-Wave system solves problems; that is, by seeking low-energy states.

Discrete (Combinatorial) Optimization Problems and CSPs

Discrete optimization, also known as combinatorial optimization, is the optimization of an objective function defined over a set of discrete values such as Booleans.

The D-Wave system can be viewed as a hardware heuristic that minimizes Ising objective functions using a physically realized version of quantum annealing. The Reformulating a Problem chapter provides techniques to reformulate various problems as Ising (or QUBO) objective functions.

Further Information

  • Getting Started with Ocean Software examples such as map coloring.
  • [Jue2016] discusses quantum annealing for Boolean satisfiability problems.
  • [Koc2004] and related papers show that QUBOs are an effective representation for modeling and solving a variety of discrete optimization problems.
  • [Bar1982] shows that Ising and QUBO problems are NP-hard.
  • [Luc2013] discusses Ising formulations of NP problems.

Job-Shop Scheduling

The job-shop scheduling problem is to maximize priority or minimize schedule length (known as a makespan, the time interval between starting the first job and finishing the last) of multiple jobs done on several machines, where a job is an ordered sequence of tasks performed on particular machines, with constraints that a machine executes one task at a time and must complete started tasks.

Workflow Steps

  1. State this problem as a CSP instance (the Background chapter gives introductory examples to this step).
  2. Translate integer variables to binary (the Simple Map Coloring Problem section gives an example).
  3. Reformulate this CSP instance as a QUBO (using the technique described in the CSP Reformulation with Penalty Functions section, for example, if you are scheduling for priority maximization).[1]
  4. If the problem is too large to fit on the QPU, decompose the problem (using methods such as those in the Decomposing a Large Problem chapter).[2]
  5. Follow the remaining steps of the general workflow of Table 5.
[1]Scheduling for makespan minimization needs either ancillary variables, also discussed in the Reformulating a Problem chapter, or a hybrid approach, such as discussed in the Decomposing a Large Problem chapter.
[2]Typically, recasting from integer variables creates a large number of binary variables, many of which represent impossible combinations and so are unnecessary. For this type of problem, decomposition is likely needed.

Further Information

  • Getting Started with Ocean Software example Constrained Scheduling.
  • [Ven2015] describes an implementation of job-shop scheduling on the D-Wave system, which includes formulating the problem, translating to QUBO, and applying variable reduction techniques. It also talks about direct embedding of local constraints.

Fault Diagnosis

Fault diagnosis attempts to quickly localize failures as soon as they are detected in systems such as sensor networks, process monitoring, and safety monitoring. Circuit fault diagnosis attempts to identify failed gates during manufacturing, under the assumption that gate failure is rare enough that the minimum number of gates failing is the most likely cause of the detected problem.

Further Information

Additional Problems

  • Protein Folding

    Protein folding refers to the way protein chains structure themselves in the context of providing some biological function. Although their constituent amino acids enable multiple configurations, proteins rarely misfold (such proteins are a cause of disease) because the standard configuration has lower energy and so is more stable.

    • [Per2012] discusses using the D-Wave system to find the lowest-energy configuration for a folded protein.
  • Traffic-Flow Optimization

    One form of the traffic-flow optimization problem is to minimize the travel time of a group of vehicles from their sources to destinations by minimizing congestion on the roads being used.

    • [Flo2017] describes work done by Volkswagen to map a traffic-flow optimization problem on the D-Wave system.
  • Factoring

    The factoring problem is to decompose a number into its factors. There is no known method to quickly factor large integers—the complexity of this problem has made it the basis of public-key cryptography algorithms.

  • Portfolio Optimization

    Portfolio optimization is the problem of optimizing the allocation of a budget to a set of financial assets.

    • [Els2017] discusses using Markowitz’s optimization of the financial portfolio selection problem on the D-Wave system.
    • [Ros2016a] discusses solving a portfolio optimization problem on the D-Wave system.
  • Database Query Minimization (SAT Filters)

    A satisfiability (SAT) filter is a small data structure that enables fast querying over a huge dataset by allowing for false positives (but not false negatives).

    • [Dou2015] discusses uses of SAT filters with a quantum annealer.
    • [Wea2014] describes the SAT filter.

Machine Learning Problems

Machine learning algorithms operate by constructing a model with parameters that can be learned from a large amount of example input so that the trained model can make predictions about unseen data. Most implementations to-date have been based on deterministic machine learning models such as feed-forward neural networks. The real world, however, is nondeterministic and filled with uncertainty. Probabilistic models explicitly handle this uncertainty by accounting for gaps in our knowledge and errors in data sources.

In probabilistic models, probability distributions represent the unobserved quantities in a model (including noise effects) and how they relate to the data. The distribution of the data is approximated based on a finite set of samples. The model infers from the observed data, and learning occurs as it transforms the prior distributions, defined before observing the data, into posterior distributions, defined afterward. If the training process is successful, the learned distribution resembles the actual distribution of the data to the extent that the model can make correct predictions about unseen situations—correctly interpreting a previously unseen handwritten digit, for example.

A Boltzmann distribution is a discrete energy-based distribution that defines probability, \(p\), for each of the discrete states in a binary vector. For a set of \(N\) binary random variables, \(\vc{x}\), the Boltzmann distribution defines a probability for each possible state that \(\vc{x}\) can take using[3]

[3]\(\beta\) is omitted from this equation because usually, in the context of machine learning, it is assumed to be 1.
\begin{equation} p(\vc{x}) = \frac{1}{Z} \exp(-E(\vc{x};\theta)), \end{equation}

where \(E(\vc{x};\theta)\) is an energy function parameterized by \(\theta\) and

\begin{equation} Z = \sum_{\vc{x}}{\exp(-E(\vc{x};\theta))} \end{equation}

is the normalizing coefficient, also known as the partition function, that ensures that \(p(\vc{x})\) sums to 1 over all the possible states of \(\vc{x}\); that is,

\begin{equation} \sum_{\vc{x}} p(\vc{x}) = 1. \end{equation}

Note that because of the negative sign for energy, \(E\), the states with high probability correspond to states with low energy.

The energy function \(E(\vc{x};\theta)\) can be represented as a QUBO: the linear coefficients bias the probability of individual binary variables in \(\vc{x}\) and the quadratic coefficients represent the correlation weights between the elements of \(\vc{x}\). The D-Wave architecture, which natively processes information through the Ising/QUBO models (linear coefficients are represented by qubit biases and quadratic coefficients by coupler strengths), can help discrete energy-based machine learning.

Further Information

  • [Bia2010] discusses using quantum annealing for machine learning applications in two modes of operation: zero-temperature for optimization and finite-temperature for sampling.

Probabilistic Sampling: RBM

A Boltzmann machine is a stochastic recurrent neural network that defines a probability distribution over binary visible and hidden variables. A restricted Boltzmann machine has no intra-connections among the visible and hidden variables.

An RBM is shown in Figure 112.

Two-layer neural net comprising a layer of visible units and one of hidden units. Visible units are numbered V 0 through V 3. Hidden units are labeled H 0 through H 2. There are connections between the visible and hidden units, but none between units in the same layer.

Fig. 48 Bipartite neural net, with a layer of visible variables connected to a layer of hidden variables.

Further Information

  • [Sal2007] describes RBMs used to model tabular data, such as users’ ratings of movies.
  • [Dwave2] provides an example of an RBM.
  • [Hin2012] is a tutorial on training RBMs.
  • [Ben2017] discusses sampling on the D-Wave system.
  • [Ada2015] describes implementing an RBM on the D-Wave system to generate samples for estimating model expectations of deep neural networks.

Energy-Based Models

Machine learning with energy-based models (EBMs) minimizes an objective function by lowering scalar energy for configurations of variables that best represent dependencies for probabilistic and nonprobabilistic models.

Workflow Steps

  1. For an RBM as a generative model, for example, where the gradient needed to maximize log-likelihood of data is intractable (due to the partition function for the energy objective function), instead of using the standard Gibbs’s sampling, use samples from the D-Wave system. The training will have steps like these:
    1. Initialize variables.
    2. Teach visible nodes with training samples.
    3. Sample from the D-Wave system.
    4. Update and repeat as needed.
  2. For embedding, it may be possible to map variables (of the reduced problem) directly to qubits (e.g., pixels to qubits for image inputs), as shown in [Ben2017].
  3. Follow the remaining steps of the general workflow of Table 5.

Further Information

  • [Lec2006] describes EBMs.
  • [Dum2013] discusses implementing an RBM using physical computation.

Support Vector Machines

Support vector machines (SVM) find a hyperplane separating data into classes with maximized margin to each class; structured support vector machines (SSVM) assume structure in the output labels; for example, a beach in a picture increases the chance the picture is of a sunset.

Workflow Steps

  1. Initialize \(\vc w=0\) where \(\vc w=[w_1,w_2,...]^T\) are weights to be learned from the training data.

  2. Repeat until \(\vc w\) converges:

    a. Minimize a regularized objective function for each training example, recording optimal solutions \(\hat{y}\), where \(\hat{y}_i\) is the maximizer for

    \[\underset{\mathbf{y}\in\{0,1\}^{L}}{max} \left\{\Delta(\mathbf{y},\mathbf{y_{i}})+ \langle\mathbf{w},\mathbf{\Psi}(\mathbf{x_{i}},\mathbf{y}) -\mathbf{\Psi}(\mathbf{x_{i}},\mathbf{y_{i}})\rangle\right\}\]

    with features \(\Psi(\bf{x, y})\) defining a compatibility function \(f(\bf{x,y}) =<\bf{w,\Psi(\bf{x,y})}>\) on inputs \(\vc x\) and labels \(\vc y\).

    Your objective function might look something like this:

    \[F(w)=\frac{1}{d}\sum_{i=1}^{d}\left[\underset{y\in\{0,1\}^{L}}{max}\left\{\Delta(y,y_{i})+\langle w,\Psi(x_{i},y)-\Psi(x_{i},y_{i})\rangle\right\}\right]+\lambda\frac{||w||^{2}}{2}\]

    where \(\lambda\) is a regularization term to prefer simple models (avoid over-fitting) and be minimized using a subgradient such as:

    \[\partial_{w}F(w)= \frac{1}{d}\sum_{i=1}^{d} \left[\Psi(x_{i},\hat{y})-\Psi(x_{i},y_{i})\right] +\lambda w.\]

    Submit this subgradient, formulated as a QUBO (see the Reformulating a Problem chapter), to the D-Wave system for minimization.

b. Iterate over variables: calculate \(\Delta w_j=\eta \cdot (y_i - \hat{y_i})x_j\), where \(\eta\) is the learning rate, and set the new weight \(w_j \longleftarrow w_j + \Delta w_j\).

Further Information

  • [Dwave1] demonstrates how the D-Wave system can be applied within the context of structured prediction.
  • [Boy2007] gives a concise introduction to subgradient methods.
  • [Wil2019] gives a method to train SVMs on a D-Wave 2000Q, and applies it to data from biology experiments.

Opportunities and Challenges

As noted in the Introduction chapter, quantum computing has the potential to help solve some of the most complex problems that organizations face. However, this cutting-edge technology still presents formidable challenges to potential users. The Technical Description of the D-Wave Quantum Processing Unit guide describes many of these in detail. They include, integrated control errors (ICE) such as neighbor interference by qubits, flux noise, quantization errors in configuring values of biases and coupler strengths, errors due to the finite bandwidth of the system’s I/O, and physical variations between qubits. Additionally, system performance may be affected by temperature, high-energy photons, and other factors.

This is typical of young, ground-breaking technologies. Persistence in overcoming such childhood problems to exploit such technologies has frequently been rewarded. Quantum computing poses some difficult challenges to adoption but offers great opportunities.