There are two storage technologies utilised on the cluster:
- NFS: Relatively slow, persistent file system - Nightly incremental backups
- Lustre: Large, fast distributed scratch file system - No backups
Home & Research Filesystems¶
Astro & TPP users have a 5GB standard quota on their home directory due to limited resources on the file server. Users can request for this to be increased to suit their requirements.
NFS storage is used to provide the /home and /research filesystems. These are designed to be used as filers for persistent, valuable data, and not for direct cluster job I/O.
All user home directories (/home) reside on NFS shared storage. Each user will have a home directory on only one of the 3 NFS servers. Home directories are not visible to other users and as such are intended for your own personal storage, not for collaborative storage (for that see sec-research-areas).
Currently only PACT users on darshan are have home directory quotas of 5GB (which can be increased if necessary), but this is something you will be introduced to by your department’s sysadmin.
Other users do not have direct quotas, but your usage will be monitored and you may be asked to reduce your usage should it become a problem for other users.
- NOT to be used for cluster jobs I/O
- Nightly backups to tape media by ITS
- Globally available across the cluster
The first point is critical. Every user talks to one of the 3 NFS servers exclusively, and these servers consist of one large RAID6 device, meaning that I/O bandwidth and latency to this device is shared with all users of that server. If a user or job starts to do a large amount of I/O to their /home or /research areas, they will effectively monopolise I/O to this server having a very large and noticeable impact to all other users of this server. For example, other user’s shell sessions will start to have a large amount of lag.
You will get notified if you start doing this and your jobs will be killed. Any concerns about doing this correctly should be directed to the team
Research Storage Areas¶
Everything under /research are the research storage areas, and these are also mounted from the NFS servers.
The purpose of this space is, similar to the /home space for persistent, collaborative data, such as a shared group code-base or data set.
Write access to these areas is controlled by department or research group however, so if you or your research team doesn’t have an area here then do request one from us.
Similar to the /home filesystem, there are presently no quotas applied to the research filesystems, apart from for PACT users.
Lustre Scratch Filesystem¶
- Lustre is mounted under /lustre/scratch and should be the target of all job or heavy I/O when using the cluster.
- Unlike the home and research filesystems, Data on lustre has no guarantee of persistency – it will live as long as the filesystem does, but there are no backups as the filesystem is too large to realistically backup.
- We do not currently clean up old files automatically, instead relying on our users to sensibly clean up after themselves on lustre. However we reserve the right to make this request of people should the need arise and to ensure that sufficient scratch space is available for other users.
Lustre is a open-source, parallel file system that is designed for the needs of HPC users.
There is presently (Sept 2014) 270 TB of Lustre storage (see Hardware for up to date figures), with an additional ~200 TB to be installed in the near future.
There are no quotas on Lustre usage presently, however Lustre has a built-in accounting module that you can use to find out your usage of lustre from the command-line:
$ lfs quota -u your_username /lustre/scratch
this will tell give you your usage in kilobytes.
As mentioned above, Lustre is provided as scratch space, so we ask you all to not leave files on Lustre indefinitely as this wastes storage space for other users on the cluster. All important data should be transferred over to the Research Storage Areas at the earliest opportunity.
- Use for all cluster job I/O
- No guarantee of data persistence – NO backups should the filesystem fail.
- Fast, parallel filesystem designed for many users and multi-threaded/multi-process workloads.
Lustre Speed – Throughput vs Latency
Lustre is a parallel filesystem, which primarily helps support I/O from multiple jobs not block each other.
For speed however, Lustre is more optimised for high throughput I/O rather than low latency I/O. The reason for this is due to it’s architecture, as an I/O request from a process first queries the Lustre metadata server to get a list of object servers and data locations for where their data is stored, and then to contact the object servers for the data itself.
Thus Lustre may not be the fastest for reading and writing lots of small files.
Lustre shines however if you want to stream in very large datasets to processes as you can take advantages of lustre’s lustre_striping feature to distribute your file across multiple object servers, and then stream data to/from all of these servers concurrently, thus achieveing much higher I/O bandwidth as you can write at speed proportional to the number of object servers (or technically the number of object storage targets which counts each block device attached to an object server).
For low-latency I/O, see the section on Local Scratch Partition (particularly if you are not doing MPI jobs, or don’t globally available storage that is present on all the nodes).
Lustre has the ability to stripe data across multiple object storage targets, which can greatly increase the I/O bandwidth obtainable.
Chapter 25 of the Lustre Manual (direct link to PDF) covers this topic in much greater detail, but I will try to summarise a couple of key points below.
Advantages of Stiping¶
Higher bandwidth - If your application requires high-bandwidth access to a single file, then without striping you are limited to the I/O bandwidth obtainable from a single object storage target (which in reality is a single RAID6 device) and by the capacity of a single network link (Infiniband at QDR ~40Gbs).
With striping you can stripe this file across a large number of object storage targets to get the aggregate bandwidth of all of these disks.
Massive files - If your file is too large for a single object storage target (the largest for us is ~27 TB and the smallest is about ~9TB) then you will need to stripe it across targets.
Disadvantages of Striping¶
- Increased latency - striping across more object storage targets means more network operations to more hosts on the newtork. Striping small files is therefore slower and more inefficient than not-striping them.
Setting and Checking Striping Information¶
The manual linked above is the definitive source for doing this, so please have a quick look there. Below are some simple examples.
To create a new file with a specific striping pattern you can do:
# Create a new file with a stripe size of 4MB and striped across all # OSTs. The --count -1 means stripe over all OSTs. $ lfs setstripe --size 4M --count -1 /lustre/scratch/test-stripe-file # Write 1GB to this file dd if=/dev/zero of=/lustre/scratch/test-stripe-file bs=10M count=100 100+0 records in 100+0 records out 1048576000 bytes (1.0 GB) copied, 2.38365 s, 440 MB/s # Check the striping information. We can see that the file striped across 22 # OSTs in this case $ lfs getstripe /lustre/scratch/test-stripe-file /lustre/scratch/test-stripe-file lmm_stripe_count: 22 lmm_stripe_size: 4194304 lmm_stripe_offset: 12 obdidx objid objid group 12 47674178 0x2d77342 0 8 45824501 0x2bb39f5 0 13 65767597 0x3eb88ad 0 17 21272185 0x1449679 0 14 17722362 0x10e6bfa 0 15 20430814 0x137bfde 0 21 20588624 0x13a2850 0 18 21738963 0x14bb5d3 0 16 18832892 0x11f5dfc 0 11 50902118 0x308b466 0 5 43303566 0x294c28e 0 10 58960212 0x383a954 0 3 44947841 0x2add981 0 20 21799507 0x14ca253 0 9 53893708 0x3365a4c 0 4 43643836 0x299f3bc 0 19 21118926 0x1423fce 0 7 42855342 0x28debae 0 1 47550982 0x2d59206 0 2 39356676 0x2588904 0 0 40382141 0x2682ebd 0 6 43836389 0x29ce3e5 0
Local Scratch Partition¶
There is a small amount of local scratch space on all compute nodes in the cluster that can be used by jobs and may offer improved I/O latency when doing lot’s of small I/O (reading and writing to lot’s of small files for instance).
This space varies in size from node to node, but is usually >220 GB.
The cluster will make available an environment variable $TMPDIR that you can use from within your job scripts to address this area, so you should always use this, but for reference the current path is /local/scratch.
It is also nearly always only a single physical disk (no RAID), and is just a simple non-parallel filesystem (EXT4 usually), thus the latency benefits above may disappear on a busy node if lots of people start hitting the space.
However you are welcome to experiment with this space if you find your I/O workload could benefit from this.
This space get’s wiped on a node reboot.
If you do this, remember that this scratch space is local to the node you are running on, therefore you will have to manage staging-in and out whatever data you will want to read and write from this area manually yourself.
You can manage the file staging from within your job script for example:
#!/bin/bash ... # copy the data from the exported home directory to the temporary directory cp ~/files/largedataset.csv $TMPDIR/largedataset.csv # Run your job ... # copy results back to user home directory or research filesystem cp $TMPDIR/results ~/results
Data Management Strategy¶
As the cluster is a shared resource, it is vital for all research groups to have an overview of what and how their members are using the system, particularly when it comes to storage.
Storage on the cluster has been bought by various departments according to their needs and then pooled so that everyone can benefit from the increase in resources. However as this shared space becomes tighter there will be greater pressure for groups to justify their storage usage.
If your research group require a lot of storage space, then it is important to talk to the HPC admins early on so we can come up with a strategy and also make collect quotes for costs for potential new hardware needed to be bought if that is required.
You need to have an idea of what your temporary storage needs are and what your persistent storage needs are, and for what length of time you will need the data. This information will influence what we recommend as the best storage media for your usecase.
For example, if you have very large data sets to be archived, then we could budget and arrange a tape-storage archival pipeline. Or if you need more frequent access to research data, then we can look at buying more NFS servers for your needs.