COVID-19 and Folding@Home

Folding@Home is a project much like Seti@Home.  It is a distributed computing project for simulating protein dynamics, and topically, it is also doing work on COVID-19.

I was curious as to whether we could get this running on some obsolete HPC systems, home systems, and modern HPC systems. and how the performance would compare assuming I could make it work on all these different systems.

Getting it all working on ARC2

The starting point was the remnants of ARC2, which used to be the university’s main HPC, before it was obsoleted by ARC3, and more recently ARC4.  It’s out of production now, but with 137 usable 16 core nodes, still makes for a potent CPU based workhorse.  Just what could be achieved by using this system, and would it be worthwhile doing so?

First step was downloading the freely available client and running it on ARC2:

$ wget https://download.foldingathome.org/releases/public/release/fahclient/centos-6.7-64bit/v7.5/fahclient-7.5.1-1.x86_64.rpm
$ mkdir test && rpm2cpio ../fahclient-7.5.1-1.x86_64.rpm | cpio -id
$ usr/bin/FAHClient --configure
User name [Anonymous]: ARC
Team number [0]: 257437
Passkey: 
Enable SMP [true]: 
Enable GPU [true]: false
Name of configuration file [config.xml]:

Well that was easy.  We download the rpm, and extract it (since I’m doing this without admin rights), and set it up.  That all seems to work fine, so it should just be a case of running the client to generate results:

$ usr/bin/FAHClient --config config.xml
10:48:21:************************* Folding@home Client *************************
10:48:21:    Website: https://foldingathome.org/
10:48:21:  Copyright: (c) 2009-2018 foldingathome.org
10:48:21:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:48:21:       Args: --config config.xml

So the client runs, and waits for a job.  First job arrives:

10:56:16:WU00:FS00:Started FahCore on PID 9956
10:56:16:WU00:FS00:Core PID:9960
10:56:16:WU00:FS00:FahCore 0xa7 started
10:56:16:WARNING:WU00:FS00:FahCore returned: FAILED_2 (1 = 0x1)

Something’s broken, but I’m not clear what at this point.  On disk you can see it’s downloaded a particular client executable to do the solving.  This allows a generic FAHClient to then download specific scientific codes later, making it easier for them to do new calculations without requiring you to update your client, or indeed downloaded lots of clients ahead of time.  So let’s have a look at it:

$ file ./cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
./cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 3.2.0, not stripped

It’s download an executable targeting a 64bit system that supports AVX, and given us a dynamically linked executable.  Does ARC2 support AVX, and does it have all the libraries it thinks it needs?:

$ grep -q avx /proc/cpuinfo && echo yes it does
yes it does
$ ldd cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7: /lib64/libc.so.6: version `GLIBC_2.15' not found (required by cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7)
cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7)
cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7)
	linux-vdso.so.1 =>  (0x00007ffdb8fe0000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003ad2c00000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000003ad2800000)
	libm.so.6 => /lib64/libm.so.6 (0x0000003ad2400000)
	libc.so.6 => /lib64/libc.so.6 (0x0000003ad2000000)
	/lib64/ld-linux-x86-64.so.2 (0x0000003ad1c00000)

Ah, so on the plus side ARC2 does support AVX, but this has been built for a newer system, and the glibc of the host system (CentOS 6 on ARC2) is too old for the executable they’ve sent us.  So that’s not going to work then.  The answer is, as it often is, containers!  Let’s build a singularity container for ARC2 with a more modern system.  You can really quickly test this with singularity:

$ singularity run docker://ubuntu:18.04
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:5bed26d33875e6da1d9ff9a1054c5fef3bbeb22ee979e14b72acf72528de007b
 25.45 MiB / 25.45 MiB [====================================================] 0s
Copying blob sha256:f11b29a9c7306674a9479158c1b4259938af11b97359d9ac02030cc1095e9ed1
 34.54 KiB / 34.54 KiB [====================================================] 0s
Copying blob sha256:930bda195c84cf132344bf38edcad255317382f910503fef234a9ce3bff0f4dd
 848 B / 848 B [============================================================] 0s
Copying blob sha256:78bf9a5ad49e4ae42a83f4995ade4efc096f78fd38299cf05bc041e8cdda2a36
 162 B / 162 B [============================================================] 0s
Copying config sha256:84d0de4598f42a1cff7431817398c1649e571955159d9100ea17c8a091e876ee
 2.42 KiB / 2.42 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
INFO:    Creating SIF file...
INFO:    Build complete: /home/home02/scsjh/.singularity/cache/oci-tmp/bec5a2727be7fff3d308193cfde3491f8fba1a2ba392b7546b43a051853a341d/ubuntu_18.04.sif
WARNING: underlay of /etc/localtime required more than 50 (65) bind mounts
FATAL: kernel too old

Right, that doesn’t work, as the host kernel is too old to use that container.  16.04 has no such issue, so let’s try building an Ubuntu 16.04 image then.  Download the Ubuntu Folding@Home client, and using this recipe make an image:

Bootstrap: docker
From: ubuntu:18.04

%files
  fahclient_7.4.4_amd64.deb /

%post
    chmod 644 /fahclient_7.4.4_amd64.deb
    apt-get update
    apt-get install -y locales
    locale-gen en_GB.UTF-8
    apt-get install -y /fahclient_7.4.4_amd64.deb

This can be built on a virtual machine or system where you do have root access; whilst we could run a container from docker hub in the previous example without needing root, you need root to build your own custom image.  Not a problem, as once you’ve built that image you can copy it onto systems where singularity is installed, and you don’t need root to run it.  The reason I install the locales package and generate a locale for en_GB.UTF-8 (the locale I have set on ARC2) is that things get annoyed if there’s not a locale configured, so this is not uncommon to require when building containers.

$ module add test singularity
$ mkdir test
$ singularity exec test.img FAHClient --chdir test --config ../config.xml

...

11:27:13:WU00:FS00:0xa7:Project: 16403 (Run 203, Clone 0, Gen 10)
11:27:13:WU00:FS00:0xa7:Unit: 0x0000000a96880e6e5e8be0e24d281908
11:27:13:WU00:FS00:0xa7:Reading tar file core.xml
11:27:13:WU00:FS00:0xa7:Reading tar file frame10.tpr
11:27:13:WU00:FS00:0xa7:Digital signatures verified
11:27:13:WU00:FS00:0xa7:Calling: mdrun -s frame10.tpr -o frame10.trr -x frame10.xtc -cpt 15 -nt 16
11:27:13:WU00:FS00:0xa7:Steps: first=5000000 total=500000
11:27:13:WU00:FS00:0xa7:Completed 1 out of 500000 steps (0%)

So, that’s cheered up the client, and we’re now running the Folding@Home code on an old machine.  Hurrah!  You can see at this step I’ve told it to run in a test directory.  This will come in useful when I actually submit a big job to run this on multiple nodes, as each node can have its own directory for storing its workfiles.  So let’s write a job script:

#!/bin/bash
#$ -cwd -V
#$ -l h_rt=48:00:00
#$ -l nodes=1
#$ -t 1-137

module add test singularity

# Let's be properly paranoid, as LANG actually gets fed through to ARC from the SSH client.
export LANG=en_GB.UTF-8

# Give each jobs its own work directory
mkdir -p $SGE_TASK_ID

singularity exec test.img FAHClient --chdir $SGE_TASK_ID --config ../config.xml

Submit a task array job that’ll run a task per node, for the 137 nodes I know are free.  This will do, given I’m the only user on ARC2, but it might not be the most sensible way of doing it otherwise.  An exercise for the reader to think why not.

And off we go.  ARC2 lacks GPUs, so this is CPU only processing, but they are 16 core nodes, and we’ve got 137 of them to play with.  You’d reasonably think that’d add up to a lot (and it’s worth noting there are CPU only jobs in Folding@Home so even if the scores look low, it doesn’t mean it’s not in some way worthwhile).  How much does each node manage?  First job completed and it says, “Final credit estimate, 2786.00 points”, and it took ~23 minutes if you ignore the time to get a job, the time to download, and the time to upload (all of which can vary).  So best case you might be looking at ~23 million points per day, using *all* of an old HPC machine.  Having run it for a couple of days, what does this really look like?  Crufty adding up of the lines that look like this “08:36:39:WU01:FS00:Final credit estimate, 12329.00 points”:

$ grep Final submit-big.sh.o5377470.*|awk -F, '{print $2}'|sed 's/ points//'|paste -sd+|bc
26363222.00

So after running for two days, we’ve got 26 million points, or 13 million a day.  Quite a bit less than expected, but in truth, the upstream servers are pretty poor at handing out jobs consistently and quickly, so nodes hang around idle for long periods, and if you try to improve that, you lower your reward per job by asking for jobs before you’re ready to start them.

Comparisons

How does that compare with an old desktop I have at home.  Turns out an i3-2120 is worth 4511 a day, with a following wind.  So ARC2 doesn’t look so bad in that light, delivering the equivalent performance of 2922 times an old desktop PC.  That old desktop also had an Nvidia 560 Ti OEM graphics card in it.  How about that?  41509 predicted per day, so ARC2’s still 317 of those.  Less impressive, but still reasonable looking.

How about if we move onto a slightly more modern GPU?  I’ve also got a Nvidia 1070 GTX at home, and that reports ~700k a day.  I think that makes ARC2 look somewhat disappointing, as it’d only take 19 of these four year old GPUs to beat an entire HPC system with these calculations.  This obviously does then make you think what it could do with a more modern GPU setup, and I couldn’t resist but run a test on our current HPC, ARC4.  Now you don’t need to jump through the same hoops to get this all working on ARC4 as it properly supports the Folding@Home client out of the box, but if you did want to make a container work on both systems you really could, but you’d need to make sure you had all the OpenCL libraries present in the container, along with CUDA, and since this was just curiosity I skipped this step and just ran the client on the system outside the container.  There would be no performance difference, as containers do not in any way limit CPU/GPU performance.  So, using a node with 4x Nvidia V100 GPUs, how much could a single node on ARC4 generate compared to ARC2 (remembering ARC2 managed 13 million using 137 nodes per day)?  The actual result is just over 10 million, so really quite close to what the entire ARC2 cluster managed.  The power consumption figures between these two would be staggering, as the v100 GPUs were only drawing ~600W between them to do this.

Take away lessons

New code really can run on old systems using containers like Singularity, and it’s really not a lot of work to make that happen.  That said, old HPC systems also really should be thrown away and replaced by more modern systems, and the efficiency and performance gains to be had by moving code from legacy systems onto modern GPU systems (where appropriate) are mind blowing.

We’ve created a team for this test (257437) (current stats) that you’re free to also submit to if you wish.  As a result of this brief test, we got The University of Leeds to the position of number one UK university team, and in the top ten listed university teams in the world.  It’s a fun factoid, but don’t read too much into it, as the scale of our contribution pales into insignificance when compared to the top contributors.

 

Research Computing: Working from home

Staff (and postgraduate researchers) at the University of Leeds have now been instructed to observe and follow the principle of “social distancing”, by where possible, working from home (for further details please visit coronavirus.leeds.ac.uk/staff-advice/). I (Nick Rhodes) would like to share some guidance and advice to help support teams on working effectively and comfortably from home.

I have in the past worked in organisations with teams split across a number sites hundreds of miles apart, as well as some home workers. I have also worked in teams that have followed Agile and Lean principles which can be adapted for remote working with the minimum of friction. What I observed was that the teams that worked most effectively had clear and understandable rules and processes, as well as conveying those typical aspects of a good working team such as effective communication, honesty, trust, support, adaptability and willingness to learn and improve.

It will come as no surprise, that starting with light/simple processes that a team has agreed to follow are more effective than devising something complex. The simpler a process, the easier it is to understand and there are less opportunities for mistakes. A simpler process is also easier to observe how successful it, analyse and feedback with improvements and react to any imposed changes in ways of working.

In terms of behaviours, effective communication is key to ensure we continue to work as a cohesive team, not a group of individuals and helps avoid feelings of isolation.

Honesty, trust and support are essential for ensuring a positive team environment and support the challenges to individuals when working remotely (at home).

There needs to be a willingness to adapt, learn and improve – there will be a need to devise and change process, adjust working habits to accommodate the changes and challenges that working from home brings.  We will make mistakes, but we will do these together, learn from them and improve our processes.

For tools for managing and collaborating, when starting out or if you prefer, there is nothing wrong with using phones, using a whiteboard, or sheet of paper to keep track of tasks and activities (which can be shared  by photo with the rest of the team).

Below I have shared a list rules that the Research Computing Team will be starting out with. Remember to be considerate of all the team members situations (family, accessibility) so that no-one feels out.

Our proposed way of working rules:

  • Core working hours 10am to 4pm, aiming for a minimum of 7 working hours a day.
    This allows for flexibility to meet peoples personal commitments.
  • Use and stay logged into Teams. Check emails at least 3 times a day (start, mid and end of day).
  • There will be a daily check-in at 10.30am each morning starting 10.30am on Monday 23rd March.
    This will include any team-level updates that could affect the current days work, individual updates (summary of previous day progress, today’s plans, any blockers to progress- standard ‘agile’ stuff).
  • There will be weekly catch up/team meeting.
    The purpose of this is to share non-urgent news, review previous weeks progress (show and tell, summary charts), outline general activities/themes for following week, review our agreed processes/rules and suggest tweaks/improvements, agree working times/patterns for the team.
  • When scheduling work, assign relative priorities and/or time-box for the week.
    If an individual feels a task is delayed and/or taking longer than planned, raise it at a daily stand-up.
  • Only do agreed/planned work/tasks/activities, if you are unsure, put on Kanban board (reviewed by manager, next day or end of week, depending on urgency).
  • Ensure Ticketing systems and Kanban board up to date at the end of the day (its good to update at-least twice a day).

We will be using the following tools:

  • ServiceNow – pick up support tickets as normal.
  • Kanban board to track agreed planned tasks/projects (Trello – with training and support to follow)
  • Calendars will be used wisely (time box agreed time/work) and kept up to date.
  • Use the Research Computing at Leeds team for collaboration.

Update: Research Computing and HPC availability

Hello everyone,

You will have heard by now that the University is moving to a mainly remote working model.

Although our offices will be closed, we, together with our colleagues across IT Services, have prepared for this and will be continuing to work with you and run the HPC and Research Computing service as normally as possible.

All of our training and meetings are moving to online delivery until further notice and we will be using Teams, Zoom and Blackboard Collaborate to teach these. Our colleagues in IT Training will give you all the information you need to join in these sessions.

We’re here too to talk about future research proposals.

To help us work with you in the most structured way, please carry on using the ticketing system through our contact form https://bit.ly/arc-help

The ARC HPC systems will continue to be accessible and available for your research. We will continue to monitor them.

You will continue to be able to access the ARC systems remotely. This IT Knowledge Base article gives more information: https://it.leeds.ac.uk/it?id=kb_article&sysparm_article=KB0013720

You’ll still be able to apply for accounts on the ARC systems although account creation may take a while longer than normal: https://arc.leeds.ac.uk/apply/getting-an-account/

We will be sharing lots of tips and ideas to support you over the coming weeks and there will be a new Research@Home section of our Website supported by regular Blog posts (https://arc.leeds.ac.uk/blog ) with all this information too.  Please let us know if you want to contribute a Blog post to share with our community.

We’re also going be holding regular Research Computing coffee mornings.  You can join us and each other for an informal chat, a coffee and a bit of toast (you’ll have to provide these yourselves…).  More on these soon.  Our TechTalk series will continue too as will our Meet The Team sessions.  As ever, all our sessions are open to everyone.

Keep an eye on our mailing lists as we update our resources and arrange our dates:

arc-users@lists.leeds.ac.uk (HPC service announcements)

research-computing@lists.leeds.ac.uk (general research computing news and announcements)

With all of our best wishes,

Martin, Mark, Alex, Ollie, Nick, John, Adam, Phil and Sean

Very nearly almost not quite 14 years of a university HPC service

I am about to leave the University of Leeds, having been the technical lead designing and running its central High Performance Computing (HPC) systems just a few days short of 14 years. With my last machine, ARC4, currently in its pilot phase, I’m hoping that people might be interested in hearing more about what we’ve been up to.

When I arrived in 2005, we had ~50 active users (i.e. people actually running jobs) consuming 15 core years a month – which has now grown to around 700 active users consuming 1,000 core years a month. In that time we have helped well over 2000 researchers from every faculty of our University and beyond who, between them, have used over 80 thousand core years of computer time. At our peak, we were using around 0.5 MW of electricity for the IT and cooling.

Histogram plot showing growth of available core hours over time
Number of core hours per month, by machine, since the dawn of time
Histogram showing increasing users over time
Number of active (i.e. job submitting) users by month and year

I still find the idea that you can submit a job, let it run for a bit, and then get an automated email from the batch queue system telling you exactly how many days, weeks or even years of computer time you have just used as a form of magical time travel. But what’s even better, is getting to know the people using our services and the things they get up to.

Supporting researchers’ work and feeling like we’re part of your story is a major perk of the job and is the motivation for getting up in the morning and coming into work. In fact, your stories are always the highlight of our day: the industrial bread ovens; the optimising airflow in hospital wards to reduce infection; the models of how Leeds’ layout affects the behaviour of burglars; the atmospheric science of the prehistoric era; the searches for genes that increase the risk of cancer; the interstellar gas clouds; the materials science; the machine aided translation of languages. All of them.

Thanks for being so nice and for letting me be part of it 🙂

Although it is the users, the funders (and the people who convince the funders to fund) that means it all happens, procuring, installing and running ~20 tonnes of HPC equipment plus ~20 tonnes of cooling equipment is a big undertaking involving a lot of people as well as the ARC team who should all be thanked: the IT Datacentre, User Admin, Network and Finance teams; our colleagues in the Estates and Purchasing departments; the IT and specialist hardware vendors; the integrators specialising in racking, wiring and HPC technologies; the electrical and mechanical engineering companies; the hardware break/fix engineers. The number and rate of large procurements (more than implied in the top graph) has been punishing at times but, as my first boss here used to say, it was a nice problem to have. In addition, HPC as we know it just wouldn’t have happened without all the Free and Open Source software that it is built upon and needs to be put together just so, with the huge number of people who have contributed.

Much changed on the journey from HPC being mainly a minority sport used by small groups of interested academics, to being considered an essential tool widely used across many different disciplines. For us, this started in 2009 when the University was persuaded to dig even more deeply into its pockets than it had done before and commit to a substantial capital programme on the “ARC” (Advanced Research Computing) series of machines and a water cooling system – funded by all the Faculties, but free at the point of use for every researcher at the institution. It’s been a fun journey, working at a bigger scale with much more varied and complicated requirements, needing more software, better interconnects, parallel filesystems, more involved scheduling, better ways of working.

We’ve had our share of triumphs and disasters, from our colleagues in Environment on a field trip in Antarctica running daily weather forecasts on ARC1 so that they knew where to put their plane and collect measurements of the wind coming over a ridge, to when a flood directly above the machine room made its way through the cracks and took out several racks of equipment in early 2016.

Some characterise the use of HPC as the embodiment of impatience: people want tomorrow’s computer power today. Well, it turned out that, back in 2005, you needed to have a lot of patience to be impatient enough to try to use HPC. You really needed to know what you were doing. It was painful. Since then, we’ve done a lot of work to make these systems (hopefully) usable for as many different types of user as possible, from the hard core developers to those who want to run packaged commercial software. There does remain the bar of being able to use Linux, but some might say that the experience of learning a Unix-style environment is good for you.

Then, as now, there was safety in numbers – we have been members of several collaborations: the White Rose Grid; the National Grid Service (only a small involvement on my part) the first phase of funding for the DiRAC consortium; and the N8 group of research intensive universities in the north of England, for whom we ran its shared regional supercomputer (polaris).

Although the people are the most fascinating things of any workplace, I’ll leave you with a few images of past machines. But before I do, I’ll correct a misconception that Snowdon and Everest were named after mountains: they were actually part of the White Rose Grid and named after white roses, a nod to the Yorkshire Rose (between us, Sheffield, York and Leeds Universities got quite good at naming machines after roses – iceberg, chablis, akito, saratoga, pax, popcorn, pascali, etc. – although we did start to run out towards the end):

Good luck to all, and thanks for being you.

Mark Dixon

Rear of the snowdon supercomputer
Snowdon (2002-2010)
Everest supercomputer
Everest (2005-2011)
ARC1 compute racks
ARC1 and DiRAC compute racks (2009-2017)
Polaris and ARC2 infrastructure racks
Polaris (2012-2018) (left) and ARC2 (2013-) (right) infrastructure racks
ARC3 compute node
ARC3 (2017-) compute node

Changes in Research Computing

As you know, this is a time of exciting change for Research Computing, and in more than one way.

For those who don’t already know, after fourteen years working on HPC at Leeds, Mark Dixon will shortly be leaving us to take up a post at Durham with an exciting new direction for his career.  Today however, we say goodbye to Mike Wallis who, after twenty-six years is going up to Edinburgh to head up a new group there.  Great opportunities for both of them, and I’m sure you’ll join with us in wishing them all the best for the future!

Clearly, this creates a temporary gap in provision which gives us a challenge – and an opportunity – to shape the team for the future.

We are developing the new team shape to replace the existing capabilities, and to augment the service provided to researchers in ARC and beyond.

To that end, we have already recruited four new staff – John Hodrien, Oliver Clark, Nick Rhodes and Alex Coleman – who will be working to provide both support for the existing facilities, but also additional Research Software Engineering support for the University.  Some of those names will be familiar to you, but I’m sure you’ll make them all very welcome!

Of course, in the meantime we are in the advanced stages of finalising the pilot for ARC4 (our latest High Performance Computing cluster), readying this new facility for general use.

We hope that this will provide a real boost to the University, particularly supporting the work with GPUs which is seeing increasingly high demand in areas as diverse as Molecular Dynamics and Deep Learning.  As part of the rollout, we will arrange some additional meetings to launch the service and give you full details of what we now have available.

Additionally, from the start of the academic year there will be lots of opportunity to come along to information sessions, to contribute to our development planning and work with us as we go on this exciting journey to build what we intend to be the best University Research Computing service in the UK.

To give you a taste of what’s to come, this is what we are doing right now:

1. Establishing a University-wide Research Software Engineering service:
  • We’ve appointed four new RSEs (Research Software Engineers- see above) with skills and experience across Linux, Windows, Web, Cloud, Data Science, HPC and Scientific Applications
  • We will be appointing more RSEs to provide even more support to the new EPSRC Centres for Doctoral Training, the Bragg Institute and support for High Performance Computing.
  • Building out skills in the team to support Digital Humanities and Digital Health.
  • Developing a Masters’ level apprenticeship programme to help train the next generation of Research Software Engineers and Research Support Professionals.
  • Continuing to develop and update our comprehensive research skills training portfolio, working with colleagues across the University.

2. Continuing to invest heavily in High Performance Computing to support research:
  • There has been a £2 million investment (from faculties, research groups and Central University funds) in the new ARC4 HPC service.
  • ARC4 will be in ‘pilot’ in the next few weeks and launched as a production service just as soon as we’ve given the tyres a good kicking.

3. Investing in agile and adaptive Cloud-based solutions for research
  • We’re building and testing a flexible cloud-based HPC service as an adjunct to our on-premise HPC.  Perhaps you need stand-alone HPC for teaching or access to test out different hardware before you commit to purchase?
  • Piloting a Cloud-based Jupyter Notebook and R Studio service for teaching and research- accessible from anywhere with just your Web browser.
…and we are just getting started.

As ever, we’re listening, so if you’ve any thoughts or comments just drop either me (Martin Callaghan) or Mark Conmy a line.

Deep Reinforcement Learning for Games

Hey, I’m Ryan Cross and for my Computer Science MEng Project, I undertook a project in applying Deep Reinforcement Learning to the video game StarCraft II, to replicate some of the work that DeepMind had done at the time.

As part of this project, I needed to train a reinforcement learning model for thousands of games. Quickly, it became apparent that it was entirely infeasible to train my models on my computer, despite it being a fairly powerful gaming machine. I was only able to run 2 copies of the game at once, and this was nowhere near enough when for some of my tests 50,000+ runs were needed. Worse still, due to the setup of my model the smaller the number of instances I ran at once, the slower my code would converge.

It was around this time my supervisor Dr Matteo Leonetti pointed out that the University had some advanced Computing facilities (ARC) I could use. Even better, there was a large amount of GPUs there, which greatly accelerates machine learning, and was perfect for running StarCraft II on.

After getting an account, I set about getting my code running on ARC3. Quickly, I ran into an issue where StarCraft II refused to run on ARC3. After a quick Google to check it was nothing I could fix easily, I had a chat with Martin Callaghan about getting the code running in any way. It turned out that due to the setup of the ARC HPC clusters, getting my code running was as simple as adding a few lines to a script and building myself a
Singularity container. This was pretty surprising, I thought that getting a game to run on a supercomputer was going to be a giant pain, instead, it turned out to be quite easy!

The container actually ended coming in handy much later too, when I was handing my project over, I could simply ask them to run a single command or just give them my container, and they had my entire environment ready to test my code. No more “I can’t run it because I only have Python 2.7”, just the same environment everywhere.
Better for me, and better for reproducibility!

Once I’d got that all setup, running my experiments was easy. I’d fire off a test in the morning, leave it running for 8 hours playing 32 games at once and check my results when I got in. I managed to get all the results I needed very quickly, which would just have been infeasible without ARC3 and the GPUs it has. Getting results for tests was taking 30 minutes instead of multiple hours, meaning I could make changes and write up results much
quicker.

Later, I started to transition my code over to help out on a PhD project, utilising transfer learning to improve my results. At this point, I had models that were bigger than most PCs RAM, and yet ARC3 was training them happily. With how ubiquitous machine learning is becoming, its great to have University resources which are both easy to use, and extremely powerful.