Very nearly almost not quite 14 years of a university HPC service

I am about to leave the University of Leeds, having been the technical lead designing and running its central High Performance Computing (HPC) systems just a few days short of 14 years. With my last machine, ARC4, currently in its pilot phase, I’m hoping that people might be interested in hearing more about what we’ve been up to.

When I arrived in 2005, we had ~50 active users (i.e. people actually running jobs) consuming 15 core years a month – which has now grown to around 700 active users consuming 1,000 core years a month. In that time we have helped well over 2000 researchers from every faculty of our University and beyond who, between them, have used over 80 thousand core years of computer time. At our peak, we were using around 0.5 MW of electricity for the IT and cooling.

Histogram plot showing growth of available core hours over time
Number of core hours per month, by machine, since the dawn of time
Histogram showing increasing users over time
Number of active (i.e. job submitting) users by month and year

I still find the idea that you can submit a job, let it run for a bit, and then get an automated email from the batch queue system telling you exactly how many days, weeks or even years of computer time you have just used as a form of magical time travel. But what’s even better, is getting to know the people using our services and the things they get up to.

Supporting researchers’ work and feeling like we’re part of your story is a major perk of the job and is the motivation for getting up in the morning and coming into work. In fact, your stories are always the highlight of our day: the industrial bread ovens; the optimising airflow in hospital wards to reduce infection; the models of how Leeds’ layout affects the behaviour of burglars; the atmospheric science of the prehistoric era; the searches for genes that increase the risk of cancer; the interstellar gas clouds; the materials science; the machine aided translation of languages. All of them.

Thanks for being so nice and for letting me be part of it 🙂

Although it is the users, the funders (and the people who convince the funders to fund) that means it all happens, procuring, installing and running ~20 tonnes of HPC equipment plus ~20 tonnes of cooling equipment is a big undertaking involving a lot of people as well as the ARC team who should all be thanked: the IT Datacentre, User Admin, Network and Finance teams; our colleagues in the Estates and Purchasing departments; the IT and specialist hardware vendors; the integrators specialising in racking, wiring and HPC technologies; the electrical and mechanical engineering companies; the hardware break/fix engineers. The number and rate of large procurements (more than implied in the top graph) has been punishing at times but, as my first boss here used to say, it was a nice problem to have. In addition, HPC as we know it just wouldn’t have happened without all the Free and Open Source software that it is built upon and needs to be put together just so, with the huge number of people who have contributed.

Much changed on the journey from HPC being mainly a minority sport used by small groups of interested academics, to being considered an essential tool widely used across many different disciplines. For us, this started in 2009 when the University was persuaded to dig even more deeply into its pockets than it had done before and commit to a substantial capital programme on the “ARC” (Advanced Research Computing) series of machines and a water cooling system – funded by all the Faculties, but free at the point of use for every researcher at the institution. It’s been a fun journey, working at a bigger scale with much more varied and complicated requirements, needing more software, better interconnects, parallel filesystems, more involved scheduling, better ways of working.

We’ve had our share of triumphs and disasters, from our colleagues in Environment on a field trip in Antarctica running daily weather forecasts on ARC1 so that they knew where to put their plane and collect measurements of the wind coming over a ridge, to when a flood directly above the machine room made its way through the cracks and took out several racks of equipment in early 2016.

Some characterise the use of HPC as the embodiment of impatience: people want tomorrow’s computer power today. Well, it turned out that, back in 2005, you needed to have a lot of patience to be impatient enough to try to use HPC. You really needed to know what you were doing. It was painful. Since then, we’ve done a lot of work to make these systems (hopefully) usable for as many different types of user as possible, from the hard core developers to those who want to run packaged commercial software. There does remain the bar of being able to use Linux, but some might say that the experience of learning a Unix-style environment is good for you.

Then, as now, there was safety in numbers – we have been members of several collaborations: the White Rose Grid; the National Grid Service (only a small involvement on my part) the first phase of funding for the DiRAC consortium; and the N8 group of research intensive universities in the north of England, for whom we ran its shared regional supercomputer (polaris).

Although the people are the most fascinating things of any workplace, I’ll leave you with a few images of past machines. But before I do, I’ll correct a misconception that Snowdon and Everest were named after mountains: they were actually part of the White Rose Grid and named after white roses, a nod to the Yorkshire Rose (between us, Sheffield, York and Leeds Universities got quite good at naming machines after roses – iceberg, chablis, akito, saratoga, pax, popcorn, pascali, etc. – although we did start to run out towards the end):

Good luck to all, and thanks for being you.

Mark Dixon

Rear of the snowdon supercomputer
Snowdon (2002-2010)
Everest supercomputer
Everest (2005-2011)
ARC1 compute racks
ARC1 and DiRAC compute racks (2009-2017)
Polaris and ARC2 infrastructure racks
Polaris (2012-2018) (left) and ARC2 (2013-) (right) infrastructure racks
ARC3 compute node
ARC3 (2017-) compute node