AqAOA: GPU-Accelerated QAOA Simulator
AqAOA: A Practical Framework for Fast QAOA Simulations
The Quantum Approximate Optimization Algorithm (QAOA) is one of the most promising approaches for solving combinatorial optimization problems on quantum hardware. Despite rapid progress in quantum technology, today’s devices remain limited by availability, noise, and execution stability. For both researchers and developers, simulation remains an essential tool for prototyping, benchmarking, exploring the QAOA’s potential, and validating algorithmic strategies.
Existing state-of-the-art quantum circuit simulation frameworks suffer from long execution times or lack comprehensive functionality, usability, and versatility. General-purpose frameworks like Qiskit or Pennylane offer broad functionality, but typically fall short in performance and efficiency when applied to QAOA’s structured circuits and iterative optimization loops. Specialized QAOA tools exist, but often prioritize niche use cases and provide limited support for the full range of workflows needed in modern QAOA research.
To address this gap, we introduce AqAOA — a high-performance, CUDA-accelerated QAOA simulator designed from the ground up for single-GPU systems. Written in C/C++, Rust, and CUDA, and leveraging NVIDIA’s cuStateVec library, AqAOA delivers significant speedups over existing frameworks.
This blog post introduces the key concepts behind the algorithm, which builds upon the groundwork established in our previous paper on AqAOA (formerly: CUAOA). As a first proof-of-concept, we benchmarked the algorithm on the Max-Cut problem to demonstrate its effectiveness. Upcoming features will expand its capabilities further, including support for constraint-aware optimization, as described in our work on FlexQAOA.
Why Speed Matters in QAOA
Large-scale, noise-free simulations of QAOA are essential for exploring algorithm performance and benchmarking new ideas. But as circuits increase in qubit count and depth, simulation quickly becomes a computational bottleneck.
While frameworks like Qiskit and Pennylane provide valuable tooling for a wide range of quantum algorithms, they lack optimizations that are critical for efficient QAOA simulation — particularly for large circuits and iterative workflows. Specialized simulators such as QOKit address some of these gaps with optimizations like diagonal cost operator precomputation. However, many advanced features needed for practical QAOA development are still missing, such as GPU-native gradient computation, in-place sampling, and efficient handling of deep circuit layers. These capabilities become essential when exploring new problem encodings, constraint-aware mixers, or large parameter sweeps, where simulations must be executed repeatedly across varying configurations. Without a fast and flexible simulation backend, this type of experimentation becomes prohibitively slow and limits progress.
This matters because simulation isn’t just about running a single QAOA circuit. In realistic use cases, you’re often optimizing variational parameters over many iterations. Each step requires recomputing expectation values and evaluating gradients. These tasks are repeated dozens or hundreds of times, making performance a key factor — not just in runtime, but in how quickly you can experiment and iterate.
AqAOA was built with this in mind. The next section takes a closer look at how it achieves this performance through a tightly optimized architecture.
How AqAOA Works
AqAOA is built as a high-performance, single-GPU simulation framework tailored specifically to the structure of QAOA circuits. Its architecture avoids the general-purpose overhead of full quantum simulators by implementing all performance-critical components directly in CUDA, with a supporting infrastructure written in C++ and Rust, and exposed to Python for usability.
Cost Function Representation
In standard QAOA simulation, the cost operator is represented by quantum gates modeling the objective function, which needs to be applied in every layer of the QAOA circuit. However, since the cost operator is diagonal in most practical applications (as a result of basis encoding), it can be applied more efficiently by directly multiplying the statevector with a diagonal matrix. In AqAOA, the diagonalized cost Hamiltonian is computed once and reused across all layers, reducing the amount of computation required. AqAOA performs this operation entirely on the GPU, allowing full parallelization.
To improve on previous implementations like QOKit, AqAOA uses a 0-1 based representation of binary variables and expresses the objective function in a way that significantly reduces the number of required additions.
The cost function is represented as a polynomial:
where each term t_k includes a subset of variable indices v_i \in [n] and \Leftrightarrow denotes the logical equivalence, returning 1 or 0 depending on whether the condition is satisfied. The binary string \left[\sum_{v_i \in t_k} 2^i \right]_2 can be efficiently computed using bitwise logical AND operations (denoted \land_b) with the one-hot encoded form of each term t_k. Since only the entries for which the bitstring x is nonzero have to be considered for any given x, a large amount of terms can be ignored in computation.
This shortcut leads to substantial speedups — especially for cost functions made up of many low-degree terms, which are common in practice. Unlike QOKit, which evaluates all terms for all basis states, AqAOA avoids unnecessary computation by skipping zero-contributing terms.
While further acceleration is possible by exploiting problem-specific symmetries (such as repeated cost values in MaxCut), AqAOA intentionally avoids such optimizations to remain problem-agnostic.
QAOA Circuit Simulation
AqAOA starts by allocating memory for the statevector and the diagonalized cost Hamiltonian directly on the GPU.
The statevector is stored as an array of complex values and initialized in parallel. For standard QAOA with an X-mixer, each entry is set to \frac{1}{\sqrt{2^n}}corresponding to the equal superposition state, where n is the number of qubits used in the simulation.
The cost Hamiltonian is represented as a double-precision array containing the precomputed cost values for all computational basis states. These values are calculated in parallel on the GPU using the efficient, bitwise cost function described earlier.
To apply the cost unitary, the statevector (|\psi\rangle)using Euler’s formula \exp(i\theta) = \cos{\theta} + i \sin{\theta} resulting in the transformation \psi_i\mapsto \left[\cos(-\gamma_i f(x)) + i \sin(-\gamma_i f(x)) \right] \cdot \psi_i. The variational parameter \gamma_i is passed as an input to this operation.
Although the cuStateVec library provides a method for applying a diagonal matrix, it does not support the in-place multiplication with the \gamma parameter. AqAOA avoids this resource inefficiency by performing the operation directly, reducing the number of kernel calls.
The mixer unitary is applied using the cuStateVec’s built-in functionality to apply an R_X(-2 \beta_i) gate to each qubit.
Gradient Computation
To compute the gradients of the QAOA circuit with respect to its variational parameters \gamma and \beta, AqAOA uses a QAOA-specific simplification of the adjoint differentiation method — a state-of-the-art technique for gradient evaluation in classical quantum circuit simulation.
The implementation takes advantage of the well-known identity \frac{\partial}{\partial t}e^{tA} = A e^{tA} which directly implies that:
Since the cost Hamiltonian H_C is diagonal, its application is trivial. For the mixer Hamiltonian H_M, the operator reduces to a layer of X-gates making the simulation especially efficient.
The adjoint method requires uncomputing the QAOA circuit during the backward pass of the gradient calculation. Thanks to the identity above and the fact that both H_C and H_M are Hermitian (i.e., H_C^{\dagger} = H_C and H_M^{\dagger} = H_M), the uncomputation simply amounts to applying i H_C or i H_M, respectively. AqAOA simplifies this further by removing the imaginary unit i from the numerical routines, switching from real to imaginary parts in the final expressions (as detailed in Sec. II-B of the corresponding paper).
Benchmarking the MaxCut Problem
The MaxCut problem is a classic challenge in combinatorial optimization, with wide applications in circuit design, image segmentation, statistical physics, and network analysis. Given an undirected graph G = (V, E) with weighted edges, the goal is to divide the set of vertices V into two disjoint subsets such that the total weight of the edges crossing between them is maximized.
In QAOA, the MaxCut objective is encoded as a cost Hamiltonian. Assigning binary values x_i \in \{0, 1\} to each vertex v_i, the cost function for the standard MaxCut problem is:
where w_{ij} are the edge weights. In the special case where all weights are equal to 1, the problem becomes the unweighted MaxCut.
For extended variants (e.g., weighted MaxCut or node-weighted MaxCut), diagonal terms are included in the objective:
where w_i = w_{ii} are optional vertex weights.
To evaluate the performance of AqAOA on this problem class, we compared the AqAOA against QAOA implementations in Qiskit and Pennylane and the specialized QAOA simulator QOKit, with regards to the runtime required to compute the expectation value, sampling and in parameter training using gradient-based methods. The experiments focused on three key runtime evaluations for
- Computing the exact expectation value,
- sampling from the statevector,
- and computing gradients.
Runtime for expectation value computation with p=6
AqAOA achieves the fastest execution times when computing expectation values of QAOA circuits, significantly outperforming QOKit, Qiskit, and Pennylane across all tested problem sizes. For small to mid-sized instances (up to ~20 qubits), AqAOA is orders of magnitude faster, thanks to its optimized, CUDA-native implementation. The simulator maintains efficiency until around 16 qubits, where exponential runtime growth begins to appear. The primary bottleneck is the sequential application of mixer gates, which becomes dominant as the number of qubits increases due to the lack of parallelism in full-state updates.
Runtime for sampling with 1024 shots and p=6 .
Sampling performance was tested using QAOA circuits at depth p=6 with 1024shots. AqAOA once again showed high efficiency, with runtimes closely matching those for expectation value evaluation. While Qiskit also performs reasonably well in sampling, Pennylane shows degraded performance, particularly for smaller problem sizes—likely due to differences in internal sampling implementations. For comparison, a minimal Python-based workaround was used to simulate QOKit sampling, highlighting its limitations: once the number of qubits increases, the CPU-based approach becomes uncompetitive due to statevector dimensionality.
Runtime for gradient calculation using the adjoint differentiation method with p=6.
To benchmark gradient-based optimization workflows, AqAOA was compared against Pennylane — the only baseline supporting adjoint differentiation in this setup. While both simulators use the same underlying technique, AqAOA incorporates QAOA-specific optimizations that significantly reduce runtime. For problem sizes up to 18 qubits, AqAOA is consistently up to 100x faster, and still maintains a 10x speedup for larger instances. Additional tests with optimizers such as L-BFGS and BFGS confirm this trend, with AqAOA delivering up to two orders of magnitude faster performance during full training runs.
For a comprehensive overview of the theoretical foundations and experimental results, see our publication on AqAOA (formerly CUAOA) here.
Availability
AqAOA is both available in our Luna platform for plans supporting GPU resources and as open-source software on Github.
In addition to Luna, AqAOA can be launched directly from the AWS Marketplace via an Amazon Machine Image (AMI). This setup includes all required dependencies, GPU drivers, and the CUDA environment, making it ideal for running high-performance QAOA simulations out of the box — no manual setup required.
For developers and researchers looking to dive into the code or customize the simulator, AqAOA is also fully open-source and hosted on GitHub. The repository includes build instructions, example scripts, and API bindings for Python and Rust. Whether you’re experimenting with QAOA circuits, integrating simulation into larger pipelines, or building on top of the existing implementation, the codebase is designed to be both modular and extensible.
Looking Ahead
While AqAOA already delivers state-of-the-art performance on single-GPU systems, several future directions will extend its capabilities even further. One step will be expanding support for multi-GPU setups, enabling the simulation of even larger circuits through distributed memory and parallel execution — without changing the core logic of the implementation.
Another coming addition is the support of constraint-aware optimization, combining the architecture with our FlexQAOA algorithm. By implementing mixers that preserve problem-specific constraints directly, AqAOA can drastically reduce the effective simulation space. This will not only improve performance for heavily constrained problems, but also open the door to new classes of QAOA applications beyond standard encodings.
Conclusion
AqAOA is a high-performance QAOA simulator built for modern GPU hardware. It leverages the diagonal structure of QAOA’s cost operator, provides efficient expectation value and gradient computation, and includes GPU-native support for both forward execution and parameter training.
Across all benchmarks, AqAOA consistently outperforms existing tools like QOKit, Qiskit, and Pennylane — achieving up to 100x speedups in gradient evaluation and 10x improvements in circuit execution and sampling. Thanks to its performance and feature set, AqAOA sets a new baseline for single-GPU QAOA simulation.
The simulator isintegrated into our Luna platform, available on GitHub, and via pre-configured images on the AWS Marketplace for immediate use.
About Aqarios
Aqarios, headquartered in Munich, Germany, is a leading provider of quantum computing solutions across industries such as energy, aerospace, logistics, finance, manufacturing and many more. The company delivers advanced quantum software that focuses on optimization, machine learning, and simulation, offering practical tools that address critical business challenges. Aqarios has collaborated with globally recognized organizations to deliver tailored quantum solutions that drive efficiency and innovation.
Founded in 2021 by three professors and seasoned business professionals, Aqarios is a spin-off from the QAR-Lab at LMU Munich, a globally renowned hub for quantum computing research that ranks among the world’s top quantum computing institutes. With nearly a decade of experience in quantum applications, Aqarios is at the forefront of quantum innovation, leveraging its deep expertise to bridge the gap between theoretical quantum research and real-world applications.