Cluster Structure

The HPC cluster at Sussex was originally formed out of two separate clusters developed by IT Services (ITS) and the School of Mathematical and Physical Sciences (MPS). That later merged into what is now the university’s cluster.

Note

The hardware that makes up the compute nodes is quite varied, reflecting the number of different groups within the university that have contributed money and hardware at different times.

Hardware

graph G {

  graph [rankdir = "TB" ranksep=0.1 fontsize=10 fontname="Verdana" compound=true style=filled color=white];
  node [style=filled shape=record fontsize=10 fontname="Verdana"];

  subgraph cluster_0 {
    color=lightgrey;

    subgraph cluster_1_1 {
      label = "Cluster Master";
      color=white;
      "apollo-master";
    }

    subgraph cluster_1_2 {
      label = "Login Nodes";
      color=white;
      node [color=lightblue];
      "feynman" "apollo" "econ";
    }

    subgraph cluster_2_1 {
      label = "Lustre Servers";
      color=white;
      node [color=palegreen];
      "mds1" -- "oss1 ... oss8" [style=invis];
      "mds2" -- "oss1 ... oss8" [style=invis];
    }

    subgraph cluster_2_2 {
      label = "NFS File Servers";
      color=white;
      node [color=indianred1];
      "nfs001" -- "darshan" [style=invis];
      "nfs002" -- "darshan" [style=invis];
    }

    subgraph cluster_3_1 {
      label = "nodes";
      color=white;
      node [color=khaki];

      subgraph cluster_4_1 {
        style=dashed;
        color=lightgrey;
        label = "Compute";
        "node001" -- "node002" [style=invis];
        "node..." -- "node208" [style=invis];
      }

      subgraph cluster_4_2 {
        style=dashed;
        color=lightgrey;
        label = "GPU";
        "node150" -- "node151" [style=invis];
        "node152" -- "node153" [style=invis];
      }

      subgraph cluster_4_3 {
        style=dashed;
        color=lightgrey;
        label = "Grid";
        "node101" -- "node102" [style=invis];
        "node113" -- "node114" [style=invis];
      }
    }

    subgraph cluster_3_2 {
      label = "Service Nodes";
      color=white;
      node [color=lightsalmon];

      subgraph cluster_3_2_1 {
        label = "Cluster";
        style=dashed;
        color=lightgrey;
        "git" -- "svn" -- "pxe" -- "puppet" [style=invis];
        "ganglia" -- "icinga" -- "packages" -- "wiki" [style=invis];
      }

      subgraph cluster_3_2_2 {
        label = "Grid";
        style=dashed;
        color=lightgrey;
        "grid-cream-02" -- "grid-argus-02" -- "grid-apel-02" -- "grid-bdii-02" [style=invis];
        "grid-storm" -- "grid-squid-01" -- "grid-ui-01" [style=invis];
      }
    }

  "apollo-master" -- "feynman" [ltail=cluster_1_1 lhead=cluster_1_2 style=invis]
  "apollo-master" -- "apollo" [ltail=cluster_1_1 lhead=cluster_1_2 style=invis]
  "feynman" -- "nfs001" [ltail=cluster_1_2 lhead=cluster_2_2 style=invis]
  "feynman" -- "mds2" [ltail=cluster_1_2 lhead=cluster_2_1 style=invis]
  "feynman" -- "git" [ltail=cluster_1_2 lhead=cluster_3_2 style=invis]
  "feynman" -- "grid-cream-02" [ltail=cluster_1_2 lhead=cluster_3_2 style=invis]
  "grid-bdii-02" -- "node001" [ltail=cluster_3_2 lhead=cluster_3_1 style=invis]
  "puppet" -- "node001" [ltail=cluster_3_2 lhead=cluster_3_1 style=invis]


  }
}

There are 3 main parts to the cluster, as shown in the cluster diagram above:

  • Administrative nodes

    Master - Cluster management node, for privileged users only.

    Login - Your primary gateway to the cluster; for submitting your jobs.

  • Service nodes

    NFS - File servers for home and research storage.

    Lustre - Specialised distributed file system for I/O intensive jobs.

    Cluster - Deployment, monitoring and other services for the cluster.

    Grid - GridPP infrastructure servers.

  • Worker nodes

    Compute - Total figures for current hardware in operation are:

    Logical Cores 3140
    Physical Cores 3004
    Total RAM (GB) 11941.3
    Total Lustre (TB) 270.8
    Total NFS (TB) 60

    All CPU’s are AMD64, x86_64 or 64 bit architecture, made up of a mixture of Intel and AMD nodes varying from 8 cores up to 64 cores per node.

    The highest memory node in the cluster has 512GB of RAM across 64 cores.

    Tip

    It is safe to assume you can address 2GB of memory per slot when profiling jobs on the cluster. See Running Jobs for more infomation about slots and how to allocate resources for your job.

    Tip

    When running batch jobs, the scheduler will distribute your work to the compute nodes. You don’t interact with the nodes directly. Alternatively, the batch system allows you to use the system interactively, i.e. perform tasks in a shell environment. You must avoid running jobs directly on the login nodes.

GPU - There are currently 4 GPU nodes on the cluster:

Node Hostgroup GPU Model CUDA Cores Memory
node150 @gpu_nodes_k20m 2x Nvidia Tesla K20m 2496 per card 4GB per card
node151 @gpu_nodes_k20m 2x Nvidia Tesla K20m 2496 per card 4GB per card
node152 @gpu_nodes_k20m 2x Nvidia Tesla K20m 2496 per card 4GB per card
node153 @gpu_nodes_k40m 4x Nvidia Tesla K40m 2800 per card 12GB per card

Grid - Nodes dedicated to running jobs for GridPP infrastructure.

Operating System

Note

Currently all administrative and compute nodes use SL 6.5 (Carbon). Service nodes use a mix of SL and CentOS 6.5.