Exploring the Power of Aire: A First Look at Performance Gains
Patricia Ternes
Guess what? I’m the first person to get my hands on Aire, our brand-new HPC system, and it’s already blowing my mind! 😊 In less than a day, I ran a quick test using parallel Python—and the performance jump compared to our old system is incredible! Though these initial tests are just scratching the surface—with workloads that only take seconds to complete—I'm already seeing speedups that show Aire’s potential for far larger, more complex simulations.
To give you a “sneak peek,” I re-ran a familiar code (A Journey into Parallel Python) on Aire, and even with only 40 cores (the same as I used on ARC4), the results were significantly faster. When I scaled up to Aire’s full 168-core capacity, the difference was stunning - 77% running time reduction. Keep in mind, that this is a quick, early glimpse, and these results use only a small part of Aire’s capabilities—meaning the numbers you’re about to see are just the tip of the iceberg. Features like the new NVMe Lustre Scratch system, multi-node scalability, and GPU nodes are waiting to be explored in future. The team is ramping up tests to unlock Aire's full potential—exciting discoveries are ahead! We’ll keep you posted!

Percentage of time reduction for different test types. For each test, the comparison was made with a similar ARC4 test - besides the 168 cores test that was compared with the best ARC4 time.
Now, for the exciting part! Below are some of the highlights, showing how Aire handled different approaches to parallel Python, from plain multiprocessing to chunked parallelism 😊.
The Prime-Finding Challenge
To explore Aire’s capabilities, we used a prime-finding problem that’s ideal for testing different parallel approaches. The task involves searching for prime numbers across a range of integers. Those interested in more technical details can refer to our earlier blog post on parallel Python, but here’s a brief overview of this task.
The core challenge lies in the imbalance inherent in prime-finding. For smaller ranges, the work each core performs is more uniform. In this case, a simpler parallel plain approach—where tasks are equally split among cores without further division—works effectively, as there’s less need for specialised load balancing.
However, as we scale up to larger ranges, the imbalance becomes far more pronounced. Some cores will be assigned numbers that quickly yield prime-check results, while others may process numbers that take significantly longer to check. This variability means that some cores may finish early, waiting idly for others to complete. Chunked parallelism addresses this by dividing tasks into smaller, more manageable chunks, allowing us to distribute work more evenly and reduce idle time.
Serial Performance: One Core, Big Difference
It’s often easy to overlook the importance of serial performance in the HPC world, where parallel processing reigns supreme. However, when we compare the serial performance of Aire and ARC4—one core against one core—the difference is surprisingly exciting! With no parallelisation involved, we get a clear view of the improvements simply from the power and efficiency of Aire’s hardware and software upgrades.
System | Cores | Time (s) for 10^6 numbers range | Time (s) for 10^7 numbers range | Time (s) for 10^8 numbers range |
---|---|---|---|---|
ARC4 | 1 | 2.88 | 71.11 | 1941.63 |
Aire | 1 | 2.17 | 51.64 | 1360.24 |
These results are thrilling! 🎉 Just by running a single core on Aire, we’re seeing a 25% improvement when tackling small workloads, and a substantial 30% boost when tackling larger workloads (10^8 numbers). This comparison highlights the effect of Aire’s upgraded capabilities. Also, note that our biggest workload is a small ~30-minute test (that will become a few seconds once parallelised) - heavy real-world problems will see a much more pronounced improvement
Key Takeaways:
- Faster Core Performance: With no optimisations or parallelisation, a single core on Aire already performs significantly faster than on ARC4. This improvement means that even for codes that aren’t fully parallel or optimised, researchers will see substantial time savings.
- Enhanced Efficiency: Aire’s performance suggests fewer bottlenecks and a more streamlined system, from memory handling to CPU speed, making tasks that require heavy computation more manageable even without parallel methods.
This single-core improvement isn’t just a technical footnote—it’s a tangible boost that researchers will feel across all sorts of computational tasks. And remember, this is before any parallelisation! As we move into the next section, we’ll dive into Aire’s power in parallel processing, where the real magic happens.
Parallel Plain Performance: Scaling Up with Aire’s Power
After seeing the boost in serial performance, the next step is to see how Aire performs when we start scaling up with parallel processing. Using the same prime-finding code with the parallel plain approach, we distributed the workload across multiple cores without chunking, simply letting each core handle one task at a time. Here’s where Aire’s expanded core count, memory, and upgraded architecture really begin to shine!
Results:
System | Cores | Time (s) for 10^6 numbers range | Time (s) for 10^7 numbers range | Time (s) for 10^8 numbers range |
---|---|---|---|---|
ARC4 | 40 | 2.41 | 18.56 | 238.96 |
Aire | 40 | 1.05 | 11.52 | 139.84 |
Aire | 168 | 1.12 | 11.11 | 121.31 |
These results make it clear: Aire is transforming parallel processing for our workloads! Even at 40 cores, it’s already significantly outperforming ARC4. Then, when we unlock all 168 cores, the time drops even further, allowing Aire to handle even the largest datasets with efficiency ARC4 can’t match.
Highlights:
- 40-Core Comparison: With only 40 cores, Aire cuts down execution times by an impressive ~40% to ARC4. This reduction demonstrates the impact of Aire’s advanced CPU power and optimised system configuration.
- Scaling Up with 168 Cores: When running with all 168 cores, Aire showcases its full standard node capacity. For the largest workload, it achieves a time reduction of almost 50% compared to 40 cores. With smaller tasks, the overhead of managing additional cores becomes larger than the time savings from parallelization as expected. This effect is a great reminder that parallel processing is most efficient for larger, more complex workloads.
These parallel plain results highlight that even without further optimisation, Aire’s raw processing power enables faster, more efficient computations. Next, we’ll dive into chunked parallelism, where we can push Aire even further by reducing communication overhead and maximising parallel potential.
Chunked Parallelism: Maximising Aire’s Power through Smarter Task Distribution
After seeing the gains from the parallel plain approach, we turn to chunked parallelism—a method that goes a step further by breaking tasks into manageable chunks for each core. This approach not only reduces time spent in core communication but also balances the workload distribution. And on Aire, chunking shows even more impressive performance boosts!
In this comparison, we tested different chunk sizes on both 40 and 168 cores for the biggest workload (10^8 numbers range). Here’s how Aire stacks up against ARC4, this time in terms of percentage improvements.
System | Cores | Time (s) for 40 Chunks | Time (s) for 160 Chunks | Time (s) for 400 Chunks | Time (s) for 4000 Chunks |
---|---|---|---|---|---|
ARC4 | 40 | 97.32 | 84.70 | 88.95 | 85.67 |
Aire | 40 | 53.01 | 43.99 | 47.64 | 44.68 |
Aire | 168 | 51.82 | 20.56 | 20.77 | 25.30 |
Highlights:
- 40-Core Comparison: With 40 cores, Aire shows remarkable improvements over ARC4. Using 160 chunks, Aire processes the largest workload (10^8 numbers) 48% faster than ARC4. This substantial improvement can’t be explained by CPU power alone - if you remember in the serial test we got just a 30% improvement. Aire’s improved memory access allows each core to retrieve and process data faster, while OmniPath’s faster intra-node communication speeds up core-to-core coordination, particularly in managing chunk assignments. Together, these upgrades give Aire an edge in handling large, parallelised tasks even more efficiently.
- Scaling Up with 168 Cores: When we scale up to 168 cores, chunked parallelism shows Aire’s potential Using 160 chunks, the time to process the largest workload is 76% faster than ARC4!
Conclusion: Aire’s Bright Future for Research Computing
These results are incredibly promising and showcase Aire’s potential to transform our computing workflows. With just a few tweaks in job configuration, we’ve seen runtime reductions of up to 76% compared to ARC4. And here’s the best part—this was a simple, “vanilla” test, where we used the exact same code that was previously tested on ARC4. This means that Aire’s powerful improvements are already evident without any further code optimisations.
But we’re only scratching the surface. More advanced tools, like Dask, could further enhance how we manage and distribute tasks, providing even more control and efficiency in handling complex workflows, taking full advantage of Aire's new scheduling system. Aire’s capacity to leverage such frameworks opens up endless possibilities for customisation, ensuring each workload is configured to run as efficiently as possible.
Additionally, this experiment didn’t even factor in Aire’s 88TB NVMe Lustre scratch system. This high-speed flash storage will enable much faster data access and storage for I/O-intensive tasks, offering a “game-changing” boost for many workflows where rapid access to large datasets is essential. We’re looking forward to seeing how this new storage solution accelerates performance across a range of applications.
And let’s not forget, these tests were done on a single standard node, so there’s even more potential to uncover as we begin running multi-node jobs on Aire. With the enhanced scheduler, OmniPath networking, and increased core count across nodes, multi-node configurations will let us tackle the largest research problems with speed and efficiency like never before. And that’s without mentioning the GPU nodes. With 28 nodes, each equipped with 3 NVIDIA L40S GPUs, the opportunities for high-performance, GPU-accelerated research are immense.
Aire’s performance in these initial tests is just a glimpse into its capabilities. We’re standing at the edge of an exciting future in research computing, where the possibilities for acceleration, innovation, and breakthrough results feel truly limitless. The best is yet to come!
Code and Environment Consistency
All codes and package versions used in this test are identical to those used in our previous ARC4 testing, as detailed in our earlier blog post. The only difference lies in the job submission script, which was updated for compatibility with Aire’s Slurm scheduler. The Slurm script is provided below.
Aire Job Submission
#!/bin/bash
#SBATCH --job-name=primes
#SBATCH --time=5:00:00
#SBATCH --export=NONE
#SBATCH -n 168
module add miniforge/24.7.1
conda activate parallel-series
python primes.py
Author
Patricia Ternes
Research Software Engineer Manager