First look at PyTorch on Aire
John Hodrien, Patricia Ternes
How does Aire’s cutting-edge hardware stack up? Let me give you a sneak peek: three NVIDIA L40S GPUs on Aire outperformed four V100 GPUs on ARC4 by 29% in sequences per second while using less power! That’s a significant jump, especially given this was just a quick test to get a feel for Aire’s capabilities.
This experiment wasn’t about pushing the limits or showcasing optimal configurations. Instead, it was a simple “kick of the tyres” to check that Aire’s GPUs were functional and that multi-GPU workloads were running smoothly. I picked a straightforward PyTorch benchmark, ran it on both systems and let the results speak for themselves.
Here’s what I found:
- Aire’s three L40S GPUs delivered 300 sequences per second, compared to ARC4’s four V100 GPUs achieving 232 sequences per second.
- Even under load, Aire’s GPU node stayed cooler and consumed less power, demonstrating impressive efficiency.
- The job setup and execution were seamless, a testament to Aire’s well-integrated GPU nodes and updated system design.
If that’s got you excited, let’s dive into the details of how the test was set up and what we learned from this first glance at Aire’s GPU capabilities!
PyTorch Benchmark
For the experiment, I selected a straightforward, easy-to-run multi-GPU benchmark from the PyTorch Benchmarks repository, which could run on both Aire and ARC4.
Aire Experiment
Aire Setting it up was straightforward.
module add miniforge conda create -n pytorchbenchmark pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia conda activate pytorchbenchmark git clone https://github.com/aime-team/pytorch-benchmarks cd pytorch-benchmarks pip3 install -r requirements.txt
#!/bin/bash # Ignore the shell's environment #SBATCH --export=NONE # Run on a single node, with all 24 cores, and three GPUs, for up to an hour #SBATCH -N 1 -c 24 -t 1:0:0 -p gpu --gres gpu:3 module add miniforge conda activate pytorchbenchmark python3 main.py --num_gpus 3 --model bert-large-uncased --data_name squad --global_batch_size 180 -amp --compile
sbatch submit.sh
) and see what the results look like (taking a mid-run output snippet):Epoch [1 / 10], Step [490 / 493], Loss: 4.9542, Sequences per second: 300.8 GPU-ID: 0, Temperature: 59 °C, Fan speed: 0%, GPU usage: 100%, Memory used: [44.8/ 45.0] GB GPU-ID: 1, Temperature: 60 °C, Fan speed: 0%, GPU usage: 100%, Memory used: [44.8/ 45.0] GB GPU-ID: 2, Temperature: 61 °C, Fan speed: 0%, GPU usage: 100%, Memory used: [44.8/ 45.0] GB
Comparing Aire with ARC4
To understand how Aire’s GPUs compare with ARC4, I ran a similar test on ARC4 using its older V100 GPUs, adjusting the batch size (from 180 to 150) to fit their smaller memory:
python3 main.py --num_gpus 4 --model bert-large-uncased --data_name squad --global_batch_size 150 -amp --compile
Epoch [1 / 10], Step [590 / 599], Loss: 4.9348, Sequences per second: 232.2 GPU-ID: 0, Temperature: 72 °C, Fan speed: 0%, GPU usage: 99%, Memory used: [30.3/ 32.0] GB GPU-ID: 1, Temperature: 61 °C, Fan speed: 0%, GPU usage: 99%, Memory used: [30.3/ 32.0] GB GPU-ID: 2, Temperature: 60 °C, Fan speed: 0%, GPU usage: 99%, Memory used: [30.3/ 32.0] GB GPU-ID: 3, Temperature: 69 °C, Fan speed: 0%, GPU usage: 98%, Memory used: [30.3/ 32.0] GB
Power Consumption
nvidia-smi
to monitor power consumption during the runs. This tool provides real-time information on GPU usage, including power draw, temperatures, and memory usage. Aire’s L40S GPUs demonstrated a headline peak power draw of 1050W compared to the 1200W peak for ARC4’s V100 GPUs.
Highlights
- Aire’s three NVIDIA L40S GPUs outperformed ARC4’s four V100 GPUs, delivering a 29% improvement in sequences per second while consuming less power. The L40S GPUs have a peak power draw of 1050W compared to the V100’s 1200W.
- The EPYC CPUs in Aire’s GPU nodes also consume less power than ARC4’s, further improving performance per watt. Future tests may explore additional power-saving strategies to enhance efficiency further.
- The L40S GPU node demonstrated excellent stability and temperature management under load, with a design that minimises power use while maintaining performance.
Conclusion: Aire’s Performance, Efficiency, and Sustainability
These results showcase Aire’s exciting potential to revolutionise GPU-accelerated computing for research in Leeds. In this simple test, Aire’s NVIDIA L40S GPU node delivered a 29% performance improvement over ARC4’s V100 GPU node, even while using fewer GPUs. It achieved this while consuming less power and generating less heat, making Aire not just faster but also more efficient.
This efficiency is a step towards greater sustainability in research computing. By delivering more performance per watt and reducing cooling demands, Aire helps minimise the environmental footprint of computational research. As workloads grow and energy consumption becomes an increasingly pressing concern, systems like Aire demonstrate how advanced technology can balance performance and sustainability.
Authors
John Hodrien
Research Software Engineer
Patricia Ternes
Research Software Engineer Manager