There are significant improvements to the batch scheduler, particularly relating to the syntax for submission of parallel jobs. This is referred to as the node syntax.
ARC2 consists of a constellation of HP based servers and storage. A schematic of the rack layout is below, which is separated into a high density component geared towards computation and a low-density portion providing mostly infrastructure (click for larger version):
|Compute||HP BL460 blade||Each blade houses one Sandy Bridge server (node). Each node is dual socket with 8-core Intel E5-2670(2.6GHz) processors (16 cores per node); 32GB of DDR3 1600MHz memory per node; 500Gb hard drive and QDR Connect-X Infiniband||190 blades; 380 CPUs; 3040 cores|
|Storage||Lustre||Two fail-over pairs delivering 4GB/s via the InfiniBand network to ~170TB usable storage on /nobackup||~170Tb|
|Network||InfiniBand||Provide a Full-Clos and 2:1 Blocking||40Gbit/s interconnect to compute blades and access to infrastructure (e.g. Lustre storage) on the edge|
|Gigabit||Management and general networks facilitating system boot. All user traffic carried over the InfiniBand network|
All user-facing systems (login and compute) are connected to the InfiniBand network and use it to transfer all user data. This is a layered network, with the latency of communication dependent upon the number of (36-port) switch hops required to route between the source and destination devices. The diagram below shows the cluster’s topology, sometimes described as a half clos network (click for larger version):
Each server has a 4X quad-data-rate (QDR) connection which can send and receive data at 3.6GB/s. Each switch has two 4X QDR links up to the core, able to transfer data at ~8GB/s.
The latency between servers connected to the same switch is around 1.1 microseconds. Between servers connected to different switches, the latency is around 1.5 microseconds.
By default, jobs will be dispatched to any compute node in the cluster. The following parameter can be used for a job to be given a better distribution of nodes:
|-l placement=optimal||Minimises number of switch hops|
|-l placement=scatter||Ignore topology concerns and run anywhere but potentially introducing more latency than necessary to all communications (default)|
Lustre file system
A large amount of infrastructure is dedicated to the Lustre parallel filesystem, which is mounted on /nobackup. This is served over infiniband, and is configured to deliver ~4GB/s from a 170TB filesystem. It is possible to tune the filesystem in a more-extreme (or conservative) manner, however this configuration achieves a sensible compromise between data integrity and performance.