Home 9 Blog 9 Luna Platform now integrates NVIDIA CUDA-Q and cuQuantum

NVIDIA CUDA-Q Integration

Luna Platform now integrates NVIDIA

CUDA-Q and cuQuantum

The Aqarios Luna Platform, a cloud-based solution to solve complex optimization problems using quantum computing, now features the NVIDIA CUDA-Q platform in its latest release. This integration makes CUDA-Q’s advanced algorithms directly accessible to users of the Luna platform. Moreover, the platform now also features a specialized simulator for FlexQAOA — Aqarios’ proprietary constraint-aware quantum optimization algorithm — built on the NVIDIA cuQuantum cuStateVec and cuTensorNet libraries.

CUDA-Q Integration in Luna

NVIDIA CUDA-Q is an open-source, backend-agnostic platform and programming model designed to accelerate hybrid quantum-classical applications across CPUs, GPUs, and QPUs (Quantum Processing Units). By integrating CUDA-Q into Luna, we’re expanding our users’ ability to experiment with gate-based quantum algorithms using CUDA-Q’s high-performance simulation stack — all with the same simple Python interface they already know.

With just a single line of code, users can now switch to the CUDA-Q backend and execute algorithms such as QAOA (Quantum Approximate Optimization Algorithm) on GPU-accelerated simulators.

QAOA is one of the most established quantum-classical algorithms for solving combinatorial optimization problems. Alternating application of cost and mixer layers prepare a quantum state that encodes high-quality solutions to binary optimization tasks. Thanks to CUDA-Q’s GPU acceleration and LunaSolve’s hybrid orchestration, users can explore large-scale QAOA circuits, tune parameters efficiently, and benchmark quantum performance against classical or hybrid baselines — all within a unified environment.

Here’s how to use the CUDA-Q backend in LunaSolve, one of the core services of the Luna Platform for solving complex optimization problems, to solve a maximum-cut problem in just a few lines of code:

How to use the CUDA-Q backend in LunaSolve

Using CUDA-Q in LunaSolve is straightforward. By setting the backend parameter of an algorithm to backends.CudaGpu or backends.CudaqCpu, the CUDA-Q backend is activated for the selected solver. All single-GPU and CPU simulator targets provided by CUDA-Q are accessible through the target parameter. In addition, LunaSolve allows fine-grained configuration of the QAOA algorithm through various parameters, enabling detailed control over optimization behavior and performance. Comprehensive documentation on these parameters is available in the documentation of Luna. Further integrations are already planned — including additional CUDA-QX solvers, which will extend LunaSolve’s capabilities for hybrid quantum-classical optimization in upcoming releases.

Accelerating FlexQAOA through cuQuantum’s Specialized Simulator

While QAOA is a well-established method for quantum optimization, it is inherently limited to unconstrained problem formulations. In contrast, most industrial optimization problems involve numerous and often complex constraints. Reformulating these into unconstrained forms typically increases problem size and complexity, which quickly renders standard QAOA approaches inefficient or impractical.

To address this limitation, we developed FlexQAOA — an advanced variant of QAOA designed specifically for constrained optimization problems. FlexQAOA introduces a more expressive circuit architecture and constraint-preserving mixing strategies, allowing constraints to be handled directly within the quantum circuit rather than through costly reformulations. It is embedded in a dedicated software framework that automatically extracts the constraints of a given optimization problem, constructs the corresponding quantum circuit based on the most suitable encoding approach, and executes the solution process end-to-end.

Within LunaSolve, FlexQAOA benefits from a specialized simulator built on NVIDIA cuQuantum, significantly accelerating performance and enabling the exploration of larger, more realistic optimization instances.

For a detailed technical overview of FlexQAOA, you can refer to our paper “Efficient QAOA Architecture for Solving Multi-Constrained Optimization Problems” or read the accompanying blog article.

FlexQAOA Algorithm Architecture

Fig. 1: Circuit Diagram of the FlexQAOA algorithm. Each inequality constraint is implemented by a QPE-based indicator function. Each one-hot constraint is enforced by an XY-Ring-Mixer.
Fig. 1: Circuit Diagram of the FlexQAOA algorithm. Each inequality constraint is implemented by a QPE-based indicator function. Each one-hot constraint is enforced by an XY-Ring-Mixer.

FlexQAOA employs two dedicated techniques for constraint enforcement and encoding:

  1. XY-Mixers: Specialized mixer unitaries that restrict the quantum state evolution to the feasible subspace defined by one-hot constraints, where the sum over binary variables equals one.
  2. Indicator Functions: Quantum Phase Estimation (QPE) based subroutines map the satisfaction of inequality constraints onto ancillary qubits, enabling step-function-like penalties or conditional cost function application.

Figure 1 illustrates the resulting quantum circuit. Beyond the substantial improvements in optimization performance (see this paper or the accompanying blog article), these techniques yield a structured circuit that can be exploited by a dedicated simulation routine, resulting in efficiency gains during the classical simulation execution.

FlexQAOA’s Simulation Techniques

Fig. 2: Three-step simulation schema for a single FlexQAOA layer. First, the brute-forced cost function is applied to the state; second, ordinary X-mixers are applied using apply_matrix_batched. XY-Mixer matrices are contracted to the d>2 legs of the state tensor, corresponding to the one-hot constraints.

Our preliminary work, “CUAOA: A Novel CUDA-Accelerated Simulation Framework for the QAOA”, demonstrated that the cost layer of QAOA can be significantly accelerated by pre-computing the diagonal of the Hamiltonian through brute-force evaluation of the optimization objective and re-applying it to the phase in each QAOA layer. This blog article details the inner workings of the diagonal precomputation. Building on this foundation, the two techniques introduced above — XY-Mixers and Indicator Functions (IFs) — can be further exploited to enhance simulation efficiency.

XY-Mixers restrict the quantum state evolution to the feasible subspace, which means that infeasible states automatically have zero amplitude. As a result, a one-hot-constrained set of binary variables does not span the full state space of size 2^d, but only d. This drastically reduces the size of the state vector and thus improves simulation efficiency.

Similarly, the effect of Indicator Functions — where constraint violations contribute a constant cost — can be incorporated directly into the brute-forcing step of the cost function. This eliminates the need for ancillary qubits and QPE application during simulation, further improving performance.

Figure 2 illustrates the complete simulation workflow. The implementation relies on NVIDIA cuQuantum, an SDK for GPU-accelerated libraries that speed up quantum circuit simulations. The mixer layer operations for unconstrained qubits are executed using cuStateVec.apply_matrix_batched to accommodate the residual shape of constrained variables. The XY-Mixer matrices are then contracted onto the constrained variable legs of the tensor using cuTensorNet.contract.

Overall, the state vector required for FlexQAOA is substantially smaller than in conventional quantum circuit simulations. This efficiency enabled the simulation of optimization problems with up to 88 binary variables on a single RTX 4090, as reported in our paper “Efficient QAOA Architecture for Solving Multi-Constrained Optimization Problems”.

FlexQAOA in the Luna Platform

To demonstrate the capabilities of FlexQAOA within the Luna Platform, let’s consider a practical example. The Bin-Packing Problem involves packing items into the smallest possible number of bins without exceeding individual bin capacities.

In this example, the problem is defined using the LP format, which specifies all required constraints. It includes one-hot constraints (each item can only be placed in one bin) and inequality constraints (the capacity of each bin must not be exceeded) — making it an ideal use case for FlexQAOA.

Once the problem is loaded, the remaining steps simply involve configuring FlexQAOA with the desired parameters and dispatching the optimization task through LunaSolve.

configuring FlexQAOA with the desired parameters
Choosing backends.AqariosGpu() runs the simulation on GPU using the described simulation procedure.

Benchmarking the Simulation Performance Improvements

Fig. 3: Benchmark of the runtime between ordinary state vector simulation and the FlexQAOA tailored simulator on different sizes of Bin-Packing instances.
Fig. 3: Benchmark of the runtime between ordinary state vector simulation and the FlexQAOA tailored simulator on different sizes of Bin-Packing instances.

To illustrate the impact of the FlexQAOA simulator, we benchmarked our state-vector simulation of FlexQAOA against standard circuit-based simulators. Figure 3 compares the runtime of a conventional state-vector simulation (accelerated by cuStateVec) of the Qiskit FlexQAOA circuit with the custom simulator developed by Aqarios and drawing on cuTensorNet, evaluated across different configurations of the Bin-Packing Problem.

The results show a clear trend: the simulation speedup increases with problem size. For the problem instance (6, 3), corresponding to six items and three bins, the specialized simulator achieves a runtime improvement of roughly 200× compared to the standard implementation. On an RTX 4090, problem instances larger than (6, 3) can only be simulated using Aqarios’ specialized algorithm due to the steeper state-vector growth in conventional approaches.

Sign up for Luna for Free

The Luna Platform offers a free access tier, allowing you to explore FlexQAOA and the new CUDA-Q integration yourself. You can create an account here to get started.

All CPU backends are available immediately, while the GPU-accelerated backends are included in our premium plans. To learn more about these plans or discuss tailored access options, feel free to contact Aqarios.

For more information, detailed setup instructions, and usage examples, visit the documentation of our Luna platform.

About Aqarios

Aqarios, headquartered in Munich, Germany, is a leading provider of quantum computing solutions across industries such as energy, aerospace, logistics, finance, manufacturing and many more. The company delivers advanced quantum software that focuses on optimization, machine learning, and simulation, offering practical tools that address critical business challenges. Aqarios has collaborated with globally recognized organizations to deliver tailored quantum solutions that drive efficiency and innovation.

Founded in 2021 by three professors and seasoned business professionals, Aqarios is a spin-off from the QAR-Lab at LMU Munich, a globally renowned hub for quantum computing research that ranks among the world’s top quantum computing institutes. With nearly a decade of experience in quantum applications, Aqarios is at the forefront of quantum innovation, leveraging its deep expertise to bridge the gap between theoretical quantum research and real-world applications.

Transform Complexity

Into Competitive Advantage

At Aqarios, we empower people and organizations to solve their most complex optimization challenges — through tailored technology, strategic guidance, and deep scientific expertise. From intuitive software like our Luna platform to custom-built solutions, we offer a full spectrum of services. Whether you’re exploring cutting-edge quantum capabilities or scaling classical methods, Aqarios is your partner in turning complexity into competitive advantage.

Let’s redefine what’s possible through advanced optimization!