Very nearly almost not quite 14 years of a university HPC service

I am about to leave the University of Leeds, having been the technical lead designing and running its central High Performance Computing (HPC) systems just a few days short of 14 years. With my last machine, ARC4, currently in its pilot phase, I’m hoping that people might be interested in hearing more about what we’ve been up to.

When I arrived in 2005, we had ~50 active users (i.e. people actually running jobs) consuming 15 core years a month – which has now grown to around 700 active users consuming 1,000 core years a month. In that time we have helped well over 2000 researchers from every faculty of our University and beyond who, between them, have used over 80 thousand core years of computer time. At our peak, we were using around 0.5 MW of electricity for the IT and cooling.

Histogram plot showing growth of available core hours over time
Number of core hours per month, by machine, since the dawn of time
Histogram showing increasing users over time
Number of active (i.e. job submitting) users by month and year

I still find the idea that you can submit a job, let it run for a bit, and then get an automated email from the batch queue system telling you exactly how many days, weeks or even years of computer time you have just used as a form of magical time travel. But what’s even better, is getting to know the people using our services and the things they get up to.

Supporting researchers’ work and feeling like we’re part of your story is a major perk of the job and is the motivation for getting up in the morning and coming into work. In fact, your stories are always the highlight of our day: the industrial bread ovens; the optimising airflow in hospital wards to reduce infection; the models of how Leeds’ layout affects the behaviour of burglars; the atmospheric science of the prehistoric era; the searches for genes that increase the risk of cancer; the interstellar gas clouds; the materials science; the machine aided translation of languages. All of them.

Thanks for being so nice and for letting me be part of it ūüôā

Although it is the users, the funders (and the people who convince the funders to fund) that means it all happens, procuring, installing and running ~20 tonnes of HPC equipment plus ~20 tonnes of cooling equipment is a big undertaking involving a lot of people as well as the ARC team who should all be thanked: the IT Datacentre, User Admin, Network and Finance teams; our colleagues in the Estates and Purchasing departments; the IT and specialist hardware vendors; the integrators specialising in racking, wiring and HPC technologies; the electrical and mechanical engineering companies; the hardware break/fix engineers. The number and rate of large procurements (more than implied in the top graph) has been punishing at times but, as my first boss here used to say, it was a nice problem to have. In addition, HPC as we know it just wouldn’t have happened without all the Free and Open Source software that it is built upon and needs to be put together just so, with the huge number of people who have contributed.

Much changed on the journey from HPC being mainly a minority sport used by small groups of interested academics, to being considered an essential tool widely used across many different disciplines. For us, this started in 2009 when the University was persuaded to dig even more deeply into its pockets than it had done before and commit to a substantial capital programme on the “ARC” (Advanced Research Computing) series of machines and a water cooling system – funded by all the Faculties, but free at the point of use for every researcher at the institution. It’s been a fun journey, working at a bigger scale with much more varied and complicated requirements, needing more software, better interconnects, parallel filesystems, more involved scheduling, better ways of working.

We’ve had our share of triumphs and disasters, from our colleagues in Environment on a field trip in Antarctica running daily weather forecasts on ARC1 so that they knew where to put their plane and collect measurements of the wind coming over a ridge, to when a flood directly above the machine room made its way through the cracks and took out several racks of equipment in early 2016.

Some characterise the use of HPC as the embodiment of impatience: people want tomorrow’s computer power today. Well, it turned out that, back in 2005, you needed to have a lot of patience to be impatient enough to try to use HPC. You really needed to know what you were doing. It was painful. Since then, we’ve done a lot of work to make these systems (hopefully) usable for as many different types of user as possible, from the hard core developers to those who want to run packaged commercial software. There does remain the bar of being able to use Linux, but some might say that the experience of learning a Unix-style environment is good for you.

Then, as now, there was safety in numbers – we have been members of several collaborations: the White Rose Grid; the National Grid Service (only a small involvement on my part) the first phase of funding for the DiRAC consortium; and the N8 group of research intensive universities in the north of England, for whom we ran its shared regional supercomputer (polaris).

Although the people are the most fascinating things of any workplace, I’ll leave you with a few images of past machines. But before I do, I’ll correct a misconception that Snowdon and Everest were named after mountains: they were actually part of the White Rose Grid and named after white roses, a nod to the Yorkshire Rose (between us, Sheffield, York and Leeds Universities got quite good at naming machines after roses – iceberg, chablis, akito, saratoga, pax, popcorn, pascali, etc. – although we did start to run out towards the end):

Good luck to all, and thanks for being you.

Mark Dixon

Rear of the snowdon supercomputer
Snowdon (2002-2010)
Everest supercomputer
Everest (2005-2011)
ARC1 compute racks
ARC1 and DiRAC compute racks (2009-2017)
Polaris and ARC2 infrastructure racks
Polaris (2012-2018) (left) and ARC2 (2013-) (right) infrastructure racks
ARC3 compute node
ARC3 (2017-) compute node

Changes in Research Computing

As you know, this is a time of exciting change for Research Computing, and in more than one way.

For those who don’t already know, after fourteen years working on HPC at Leeds, Mark Dixon will shortly be leaving us to take up a post at Durham with an exciting new direction for his career.¬†¬†Today however, we say goodbye to Mike Wallis who, after twenty-six years is going up to Edinburgh to head up a new group there.¬†¬†Great opportunities for both of them, and I’m sure you’ll join with us in wishing them all the best for the future!

Clearly, this creates a temporary gap in provision which gives us a challenge – and an opportunity – to shape the team for the future.

We are developing the new team shape to replace the existing capabilities, and to augment the service provided to researchers in ARC and beyond.

To that end, we have already recruited four new staff – John Hodrien, Oliver Clark, Nick Rhodes and Alex Coleman – who will be working to provide both support for the existing facilities, but also additional Research Software Engineering support for the University.¬†¬†Some of those names will be familiar to you, but I’m sure you’ll make them all very welcome!

Of course, in the meantime we are in the advanced stages of finalising the pilot for ARC4 (our latest High Performance Computing cluster), readying this new facility for general use.

We hope that this will provide a real boost to the University, particularly supporting the work with GPUs which is seeing increasingly high demand in areas as diverse as Molecular Dynamics and Deep Learning.  As part of the rollout, we will arrange some additional meetings to launch the service and give you full details of what we now have available.

Additionally, from the start of the academic year there will be lots of opportunity to come along to information sessions, to contribute to our development planning and work with us as we go on this exciting journey to build what we intend to be the best University Research Computing service in the UK.

To give you a taste of what’s to come, this is what we are doing right now:

1. Establishing a University-wide Research Software Engineering service:
  • We’ve appointed four new RSEs (Research Software Engineers- see above) with skills and experience across Linux, Windows, Web, Cloud, Data Science, HPC and Scientific Applications
  • We will be appointing more RSEs to provide even more support to the new EPSRC Centres for Doctoral Training, the Bragg Institute and support for High Performance Computing.
  • Building out skills in the team to support Digital Humanities and Digital Health.
  • Developing a Masters’ level apprenticeship programme to help train the next generation of Research Software Engineers and Research Support Professionals.
  • Continuing to develop and update our comprehensive research skills training portfolio, working with colleagues across the University.

2. Continuing to invest heavily in High Performance Computing to support research:
  • There has been a ¬£2 million investment (from faculties, research groups and Central University funds) in the new ARC4 HPC service.
  • ARC4 will be in ‘pilot’ in the next few weeks and launched as a production service just as soon as we’ve given the tyres a good kicking.

3. Investing in agile and adaptive Cloud-based solutions for research
  • We’re building and testing a flexible cloud-based HPC service as an adjunct to our on-premise HPC.¬†¬†Perhaps you need stand-alone HPC for teaching or access to test out different hardware before you commit to purchase?
  • Piloting a Cloud-based Jupyter Notebook and R Studio service for teaching and research- accessible from anywhere with just your Web browser.
…and we are just getting started.

As ever, we’re listening, so if you’ve any thoughts or comments just drop either me (Martin Callaghan) or Mark Conmy a line.

Deep Reinforcement Learning for Games

Hey, I’m Ryan Cross and for my Computer Science MEng Project, I undertook a project in applying Deep Reinforcement Learning to the video game StarCraft II, to replicate some of the work that DeepMind had done at the¬†time.

As part of this project, I needed to train a reinforcement learning model for thousands of games. Quickly, it became apparent that it was entirely infeasible to train my models on my computer, despite it being a fairly powerful gaming machine. I was only able to run 2 copies of the game at once, and this was nowhere near enough when for some of my tests 50,000+ runs were needed. Worse still, due to the setup of my model the smaller the number of instances I ran at once, the slower my code would converge.

It was around this time my supervisor Dr Matteo Leonetti pointed out that the University had some advanced Computing facilities (ARC) I could use. Even better, there was a large amount of GPUs there, which greatly accelerates machine learning, and was perfect for running StarCraft II on.

After getting an account, I set about getting my code running on ARC3. Quickly, I ran into an issue where StarCraft II refused to run on ARC3. After a quick Google to check it was nothing I could fix easily, I had a chat with Martin Callaghan about getting the code running in any way. It turned out that due to the setup of the ARC HPC clusters, getting my code running was as simple as adding a few lines to a script and building myself a
Singularity container. This was pretty surprising, I thought that getting a game to run on a supercomputer was going to be a giant pain, instead, it turned out to be quite easy!

The container actually ended coming in handy much later too, when I was handing my project over, I could simply ask them to run a single command or just give them my container, and they had my entire environment ready to test my code. No more “I can’t run it because I only have Python 2.7”, just the same environment everywhere.
Better for me, and better for reproducibility!

Once I’d got that all setup, running my experiments was easy. I’d fire off a test in the morning, leave it running for 8 hours playing 32 games at once and check my results when I got in. I managed to get all the results I needed very quickly, which would just have been infeasible without ARC3 and the GPUs it has. Getting results for tests was taking 30 minutes instead of multiple hours, meaning I could make changes and write up results much
quicker.

Later, I started to transition my code over to help out on a PhD project, utilising transfer learning to improve my results. At this point, I had models that were bigger than most PCs RAM, and yet ARC3 was training them happily. With how ubiquitous machine learning is becoming, its great to have University resources which are both easy to use, and extremely powerful.

Moving some home directories around

Over the summer we had to do some back-end work to make user’s lives slightly better, by replacing the servers the home directories were served from (let’s call them the UFAs) with something a bit newer and shinier (let’s call this the SAN). There were a few good reasons for this; the hardware was 13 years old, for a start. We had to do some consolidation work to tidy up home directories from users who were never going to return to the institution, and we needed to have a consistent policy on home directory creation.

Historically we’ve had really good service from the UFAs, with great bandwidth and throughput, and we’ve always said that a replacement service needs to at least match what we’ve had in the past. That’s the basis of all the hardware replacement we do in HPC ; whatever we put in has to provide at least as good a service as what it replaces. So, we did some testing, to make sure we knew what the replacement matrix should look like.

When this project started, back in 2014, initial testing wasn’t good; in fact the performance for just simple dd if=/dev/zero of= bs=1M count=1024 -type on a single node, single threaded was considerably worse than on the UFAs. However, with a newer SAN and the right mix of NFS mount options and underlying filesystems – work carried out by the servers and storage team, who did an excellent job – we were able to get an improvement on some standard tasks, like extracting a standard application.

bar chart
fig 1 – how long does it take to extract OpenFOAM?

You’ll see that the time taken to extract a file was considerably larger on the replacement service. We found some interesting things; file async is a process where a server makes sure that a file has been written before it sends the acknowledgement that it’s got the file. In these cases turning async on made everything go much quicker, at the risk of data being lost – however we felt that risk was worth taking as the situation where that would occur would be most unlikely and we do have resilient backups. Single threaded performance was equivalent, and although multithreaded was not an improvement it was equivalent or better than writing to local NFS storage.

There’s also an interesting quirk relating to XFS being a 64-bit filesystem; a 32-bit application might get be told to use an inode that’s bigger than it knows how to handle, which returns an IO error. We needed to do a quick bit of work to make sure there weren’t that many 32-bit applications still being used (there are, but it’s not many, and we have a solution for users who might be affected by this – if you are, get in touch).

In the end a lot of hours were spent on the discovery phase of this project, then as we entered the test phase (Mark & I started using the new home directory server about a month before everybody else) we found a few issues that needed sorting, especially with file locking and firewalls. Once that was sorted there was a bunch of scripting that needed to happen so human error was minimised (one of the nice things about being a HPC systems admin is that you very quickly learn how to programatically do the same task multiple times), and we needed to tidy up the user creation processes – some of which have been around since the early 00s. The error catching and “unusual circumstances” routines – as you’d expect – made up the bulk of that scripting!

We’ve gone from 29 different home directory filesystems to three; performance is about the same, and quotas are larger. We’ve done all the tidying up work that means future migrations will go smoother, and although there was a bit of disruption for everybody it was all over quickly and relatively painlessly for the users (which is the most important thing). We are still keeping an eye on things, too.

Huge thanks are due to everybody in the wider IT Service who helped out.

The Carpentries and Research Computing

I’m pleased to announce that we’ve renewed our membership of The Carpentries for another year.

For those of you that don’t know what ‘The Carpentries’ are, they (we) are an international organisation of volunteer instructors, trainers, organisations and staff who develop curricula and teach coding and data science skills to researchers worldwide.

We’re pleased to be able to support the aims of the Carpentries and in conjunction with other UK partner organisations (and especially our friends at the Software Sustainability Institute ) help the wider UK research community develop their skills.

Here at Leeds, we organise and run two and three-day workshops as part of our training programme. We have a new group of instructors on board, so do keep an eye on the training calendar for upcoming workshops. We run workshops using R, Python and occasionally Matlab.

In conjunction with out colleagues at the University of Huddersfield, we’ve also attracted some BBSRC STARS to run another set of workshops. You’ll find more information about this at the Next Generation Biologists website.

In previous years we have run a number of workshops in conjunction with our colleagues in the School of Earth and Environment funded by a number of NERC ATSC awards.

If you’re interested in finding out more, perhaps to be a helper at a workshop, a future instructor or you’d like to find out more about the content of a typical workshop then please get in touch.

The Julia Programming language and JuliaCon 2018

Julia is a relatively new, free and open source scientific programming language that has come out of MIT.¬† I first played with it in 2012, back in the days when it didn’t even have release numbers — just github hashes and it has come a long way since then! In my mind, I think of it as what a language would look like if your two primary design parameters were ‘Easy to use for newbies’ and ‘Maximally JITable‘ — this is almost certainly a gross oversimplification but it doesn’t seem to offend some of the people who helped create the language.¬† Another way to think about it is ‘As easy to write as Python or MATLAB but with speed on par with C or Fortran’.

I attended Julia’s annual conference, JuliaCon, last week along with around 350 other delegates from almost every corner of the Research Computing world.¬† While there, I gave a talk on ‘The Rise of the Research Software Engineer’.¬† This was the first time I’d ever had one of my talks recorded and you can see the result below

All of the conference talks are available at https://www.youtube.com/user/JuliaLanguage/videos. If you’d like to get a flavour of what Julia can do for your computational research, a couple of the JuliaCon 2018 tutorials I’d recommend are below

An Introduction to Julia

Machine learning with Julia

JuliaCon 2018 marked an important milestone for the language when version 1.0 was released so now is a fantastic time to try out the language for the first time. You can install it on your own machines from https://julialang.org/downloads/ and we’ve also installed it on ARC3.¬† You can make it available to your ARC3 session using the following module command

module load julia/1.0.0