ARC2

Operating system

ARC2 is the second phase of the ARC service here at Leeds offers a Linux-based HPC service, based on the CentOS 6 distribution. ARC2 has been in service since January 2014.

There are significant improvements to the batch scheduler, particularly relating to the syntax for submission of parallel jobs. This is referred to as the node syntax.

Hardware

ARC2 consists of a constellation of HP based servers and storage. A schematic of the rack layout is below, which is separated into a high density component geared towards computation and a low-density portion providing mostly infrastructure (click for larger version):

Rack layout of ARC2
ARC2 Rack Layout
Initial configuration
Purpose Item Description Quantity
Compute HP BL460 blade Each blade houses one Sandy Bridge server (node). Each node is dual socket with 8-core Intel E5-2670(2.6GHz) processors (16 cores per node); 32GB of DDR3 1600MHz memory per node; 500Gb hard drive and QDR Connect-X Infiniband 190 blades; 380 CPUs; 3040 cores
Storage Lustre Two fail-over pairs delivering 4GB/s via the InfiniBand network to ~170TB usable storage on /nobackup ~170Tb
Network InfiniBand Provide a Full-Clos and 2:1 Blocking 40Gbit/s interconnect to compute blades and access to infrastructure (e.g. Lustre storage) on the edge
Gigabit Management and general networks facilitating system boot. All user traffic carried over the InfiniBand network

Network topology

All user-facing systems (login and compute) are connected to the InfiniBand network and use it to transfer all user data. This is a layered network, with the latency of communication dependent upon the number of (36-port) switch hops required to route between the source and destination devices. The diagram below shows the cluster’s topology, sometimes described as a half clos network (click for larger version):

ARC2 Network Topology
ARC2 Network Topology

Each server has a 4X quad-data-rate (QDR) connection which can send and receive data at 3.6GB/s. Each switch has two 4X QDR links up to the core, able to transfer data at ~8GB/s.

The latency between servers connected to the same switch is around 1.1 microseconds. Between servers connected to different switches, the latency is around 1.5 microseconds.

By default, jobs will be dispatched to any compute node in the cluster. The following parameter can be used for a job to be given a better distribution of nodes:

attribute comments
-l placement=optimal Minimises number of switch hops
-l placement=scatter Ignore topology concerns and run anywhere but potentially introducing more latency than necessary to all communications (default)

Lustre file system

A large amount of infrastructure is dedicated to the Lustre parallel filesystem, which is mounted on /nobackup. This is served over infiniband, and is configured to deliver ~4GB/s from a 170TB filesystem. It is possible to tune the filesystem in a more-extreme (or conservative) manner, however this configuration achieves a sensible compromise between data integrity and performance.