Home 9 Blog 9 AqAOA: A Practical Framework for Fast QAOA Simulations

AqAOA: GPU-Accelerated QAOA Simulator

AqAOA: A Practical Framework for Fast QAOA Simulations

The Quantum Approximate Optimization Algorithm (QAOA) is one of the most promising approaches for solving combinatorial optimization problems on quantum hardware. Despite rapid progress in quantum technology, today’s devices remain limited by availability, noise, and execution stability. For both researchers and developers, simulation remains an essential tool for prototyping, benchmarking, exploring the QAOA’s potential, and validating algorithmic strategies.

Existing state-of-the-art quantum circuit simulation frameworks suffer from long execution times or lack comprehensive functionality, usability, and versatility. General-purpose frameworks like Qiskit or Pennylane offer broad functionality, but typically fall short in performance and efficiency when applied to QAOA’s structured circuits and iterative optimization loops. Specialized QAOA tools exist, but often prioritize niche use cases and provide limited support for the full range of workflows needed in modern QAOA research.

To address this gap, we introduce AqAOA — a high-performance, CUDA-accelerated QAOA simulator designed from the ground up for single-GPU systems. Written in C/C++, Rust, and CUDA, and leveraging NVIDIA’s cuStateVec library, AqAOA delivers significant speedups over existing frameworks.

This blog post introduces the key concepts behind the algorithm, which builds upon the groundwork established in our previous paper on AqAOA (formerly: CUAOA). As a first proof-of-concept, we benchmarked the algorithm on the Max-Cut problem to demonstrate its effectiveness. Upcoming features will expand its capabilities further, including support for constraint-aware optimization, as described in our work on FlexQAOA.

Why Speed Matters in QAOA

Large-scale, noise-free simulations of QAOA are essential for exploring algorithm performance and benchmarking new ideas. But as circuits increase in qubit count and depth, simulation quickly becomes a computational bottleneck.

While frameworks like Qiskit and Pennylane provide valuable tooling for a wide range of quantum algorithms, they lack optimizations that are critical for efficient QAOA simulation — particularly for large circuits and iterative workflows. Specialized simulators such as QOKit address some of these gaps with optimizations like diagonal cost operator precomputation. However, many advanced features needed for practical QAOA development are still missing, such as GPU-native gradient computation, in-place sampling, and efficient handling of deep circuit layers. These capabilities become essential when exploring new problem encodings, constraint-aware mixers, or large parameter sweeps, where simulations must be executed repeatedly across varying configurations. Without a fast and flexible simulation backend, this type of experimentation becomes prohibitively slow and limits progress.

This matters because simulation isn’t just about running a single QAOA circuit. In realistic use cases, you’re often optimizing variational parameters over many iterations. Each step requires recomputing expectation values and evaluating gradients. These tasks are repeated dozens or hundreds of times, making performance a key factor — not just in runtime, but in how quickly you can experiment and iterate.

AqAOA was built with this in mind. The next section takes a closer look at how it achieves this performance through a tightly optimized architecture.

How AqAOA Works

AqAOA is built as a high-performance, single-GPU simulation framework tailored specifically to the structure of QAOA circuits. Its architecture avoids the general-purpose overhead of full quantum simulators by implementing all performance-critical components directly in CUDA, with a supporting infrastructure written in C++ and Rust, and exposed to Python for usability.

Cost Function Representation

In standard QAOA simulation, the cost operator is represented by quantum gates modeling the objective function, which needs to be applied in every layer of the QAOA circuit. However, since the cost operator is diagonal in most practical applications (as a result of basis encoding), it can be applied more efficiently by directly multiplying the statevector with a diagonal matrix. In AqAOA, the diagonalized cost Hamiltonian is computed once and reused across all layers, reducing the amount of computation required. AqAOA performs this operation entirely on the GPU, allowing full parallelization.

To improve on previous implementations like QOKit, AqAOA uses a $0-1$ based representation of binary variables and expresses the objective function in a way that significantly reduces the number of required additions.

The cost function is represented as a polynomial:

f(x) = \sum_{k=1}^{L} w_k \left( x \Leftrightarrow x \land_b \left[\sum_{v_i \in t_k} 2^i \right]_2 \right),

where each term $t_k$ includes a subset of variable indices $v_i \in [n]$ and $\Leftrightarrow$ denotes the logical equivalence, returning $1$ or $0$ depending on whether the condition is satisfied. The binary string $\left[\sum_{v_i \in t_k} 2^i \right]_2$ can be efficiently computed using bitwise logical AND operations (denoted $\land_b$ ) with the one-hot encoded form of each term $t_k$ . Since only the entries for which the bitstring $x$ is nonzero have to be considered for any given $x$ , a large amount of terms can be ignored in computation.

This shortcut leads to substantial speedups — especially for cost functions made up of many low-degree terms, which are common in practice. Unlike QOKit, which evaluates all terms for all basis states, AqAOA avoids unnecessary computation by skipping zero-contributing terms.

While further acceleration is possible by exploiting problem-specific symmetries (such as repeated cost values in MaxCut), AqAOA intentionally avoids such optimizations to remain problem-agnostic.

QAOA Circuit Simulation

AqAOA starts by allocating memory for the statevector and the diagonalized cost Hamiltonian directly on the GPU.

The statevector is stored as an array of complex values and initialized in parallel. For standard QAOA with an $X$ -mixer, each entry is set to $\frac{1}{\sqrt{2^n}}$ corresponding to the equal superposition state, where $n$ is the number of qubits used in the simulation.

The cost Hamiltonian is represented as a double-precision array containing the precomputed cost values for all computational basis states. These values are calculated in parallel on the GPU using the efficient, bitwise cost function described earlier.

To apply the cost unitary, the statevector $(|\psi\rangle)$ using Euler’s formula $\exp(i\theta) = \cos{\theta} + i \sin{\theta}$ resulting in the transformation $\psi_i\mapsto \left[\cos(-\gamma_i f(x)) + i \sin(-\gamma_i f(x)) \right] \cdot \psi_i$ . The variational parameter $\gamma_i$ is passed as an input to this operation.

Although the cuStateVec library provides a method for applying a diagonal matrix, it does not support the in-place multiplication with the $\gamma$ parameter. AqAOA avoids this resource inefficiency by performing the operation directly, reducing the number of kernel calls.

The mixer unitary is applied using the cuStateVec’s built-in functionality to apply an $R_X(-2 \beta_i)$ gate to each qubit.

Gradient Computation

To compute the gradients of the QAOA circuit with respect to its variational parameters $\gamma$ and $\beta$ , AqAOA uses a QAOA-specific simplification of the adjoint differentiation method — a state-of-the-art technique for gradient evaluation in classical quantum circuit simulation.

The implementation takes advantage of the well-known identity $\frac{\partial}{\partial t}e^{tA} = A e^{tA}$ which directly implies that:

\frac{\partial}{\partial t} e^{-i \gamma_i H_C} = -i H_C e^{-i \gamma_i H_C},\quad\text{and}\quad \frac{\partial}{\partial t} e^{-i \beta_i H_M} = -i H_M e^{-i \beta_i H_M}.

Since the cost Hamiltonian $H_C$ is diagonal, its application is trivial. For the mixer Hamiltonian $H_M$ , the operator reduces to a layer of $X$ -gates making the simulation especially efficient.

The adjoint method requires uncomputing the QAOA circuit during the backward pass of the gradient calculation. Thanks to the identity above and the fact that both $H_C$ and $H_M$ are Hermitian (i.e., $H_C^{\dagger} = H_C$ and $H_M^{\dagger} = H_M$ ), the uncomputation simply amounts to applying $i H_C$ or $i H_M$ , respectively. AqAOA simplifies this further by removing the imaginary unit $i$ from the numerical routines, switching from real to imaginary parts in the final expressions (as detailed in Sec. II-B of the corresponding paper).

Benchmarking the MaxCut Problem

The MaxCut problem is a classic challenge in combinatorial optimization, with wide applications in circuit design, image segmentation, statistical physics, and network analysis. Given an undirected graph $G = (V, E)$ with weighted edges, the goal is to divide the set of vertices $V$ into two disjoint subsets such that the total weight of the edges crossing between them is maximized.

In QAOA, the MaxCut objective is encoded as a cost Hamiltonian. Assigning binary values $x_i \in \{0, 1\}$ to each vertex $v_i$ , the cost function for the standard MaxCut problem is:

C(x) = \sum_{(i, j) \in E} w_{ij} x_i (1 - x_j),

where $w_{ij}$ are the edge weights. In the special case where all weights are equal to $1$ , the problem becomes the unweighted MaxCut.

For extended variants (e.g., weighted MaxCut or node-weighted MaxCut), diagonal terms are included in the objective:

C(x) = \sum_{(i, j) \in E} w_{ij} x_i (1 - x_j) + \sum_{i \in V} w_i x_i,

where $w_i = w_{ii}$ are optional vertex weights.

To evaluate the performance of AqAOA on this problem class, we compared the AqAOA against QAOA implementations in Qiskit and Pennylane and the specialized QAOA simulator QOKit, with regards to the runtime required to compute the expectation value, sampling and in parameter training using gradient-based methods. The experiments focused on three key runtime evaluations for

Computing the exact expectation value,
sampling from the statevector,
and computing gradients.

Runtime for expectation value computation with $p=6$

AqAOA achieves the fastest execution times when computing expectation values of QAOA circuits, significantly outperforming QOKit, Qiskit, and Pennylane across all tested problem sizes. For small to mid-sized instances (up to ~20 qubits), AqAOA is orders of magnitude faster, thanks to its optimized, CUDA-native implementation. The simulator maintains efficiency until around 16 qubits, where exponential runtime growth begins to appear. The primary bottleneck is the sequential application of mixer gates, which becomes dominant as the number of qubits increases due to the lack of parallelism in full-state updates.

Runtime for sampling with $1024$ shots and $p=6$ .

Sampling performance was tested using QAOA circuits at depth $p=6$ with $1024$ shots. AqAOA once again showed high efficiency, with runtimes closely matching those for expectation value evaluation. While Qiskit also performs reasonably well in sampling, Pennylane shows degraded performance, particularly for smaller problem sizes—likely due to differences in internal sampling implementations. For comparison, a minimal Python-based workaround was used to simulate QOKit sampling, highlighting its limitations: once the number of qubits increases, the CPU-based approach becomes uncompetitive due to statevector dimensionality.

Runtime for gradient calculation using the adjoint differentiation method with $p=6$ .

To benchmark gradient-based optimization workflows, AqAOA was compared against Pennylane — the only baseline supporting adjoint differentiation in this setup. While both simulators use the same underlying technique, AqAOA incorporates QAOA-specific optimizations that significantly reduce runtime. For problem sizes up to 18 qubits, AqAOA is consistently up to 100x faster, and still maintains a 10x speedup for larger instances. Additional tests with optimizers such as L-BFGS and BFGS confirm this trend, with AqAOA delivering up to two orders of magnitude faster performance during full training runs.

For a comprehensive overview of the theoretical foundations and experimental results, see our publication on AqAOA (formerly CUAOA) here.

Availability

AqAOA is both available in our Luna platform for plans supporting GPU resources and as open-source software on Github.

In addition to Luna, AqAOA can be launched directly from the AWS Marketplace via an Amazon Machine Image (AMI). This setup includes all required dependencies, GPU drivers, and the CUDA environment, making it ideal for running high-performance QAOA simulations out of the box — no manual setup required.

For developers and researchers looking to dive into the code or customize the simulator, AqAOA is also fully open-source and hosted on GitHub. The repository includes build instructions, example scripts, and API bindings for Python and Rust. Whether you’re experimenting with QAOA circuits, integrating simulation into larger pipelines, or building on top of the existing implementation, the codebase is designed to be both modular and extensible.

Looking Ahead

While AqAOA already delivers state-of-the-art performance on single-GPU systems, several future directions will extend its capabilities even further. One step will be expanding support for multi-GPU setups, enabling the simulation of even larger circuits through distributed memory and parallel execution — without changing the core logic of the implementation.

Another coming addition is the support of constraint-aware optimization, combining the architecture with our FlexQAOA algorithm. By implementing mixers that preserve problem-specific constraints directly, AqAOA can drastically reduce the effective simulation space. This will not only improve performance for heavily constrained problems, but also open the door to new classes of QAOA applications beyond standard encodings.

Conclusion

AqAOA is a high-performance QAOA simulator built for modern GPU hardware. It leverages the diagonal structure of QAOA’s cost operator, provides efficient expectation value and gradient computation, and includes GPU-native support for both forward execution and parameter training.

Across all benchmarks, AqAOA consistently outperforms existing tools like QOKit, Qiskit, and Pennylane — achieving up to 100x speedups in gradient evaluation and 10x improvements in circuit execution and sampling. Thanks to its performance and feature set, AqAOA sets a new baseline for single-GPU QAOA simulation.

The simulator isintegrated into our Luna platform, available on GitHub, and via pre-configured images on the AWS Marketplace for immediate use.

About Aqarios

Aqarios, headquartered in Munich, Germany, is a leading provider of quantum computing solutions across industries such as energy, aerospace, logistics, finance, manufacturing and many more. The company delivers advanced quantum software that focuses on optimization, machine learning, and simulation, offering practical tools that address critical business challenges. Aqarios has collaborated with globally recognized organizations to deliver tailored quantum solutions that drive efficiency and innovation.

Founded in 2021 by three professors and seasoned business professionals, Aqarios is a spin-off from the QAR-Lab at LMU Munich, a globally renowned hub for quantum computing research that ranks among the world’s top quantum computing institutes. With nearly a decade of experience in quantum applications, Aqarios is at the forefront of quantum innovation, leveraging its deep expertise to bridge the gap between theoretical quantum research and real-world applications.

Ready to Tackle Your Toughest Optimization or Planning Challenge?

Let's talk about your problem. We'll tell you honestly whether we can help and how.

REACH OUT