Interactive Jobs

A common need is to work on the cluster in an interactive way. For example, you want to use a piece of software with an interactive element, like Matlab, or Mathematica, or something like ROOT which has it’s own interactive interpreter and graphical layer. Or perhaps you just want to try something out without having to write a batch submission script, or you just want to run a long compile and see the output. There are many usages for this kind of work and often having to convert the job into batch form is too onerous to be worthwhile.

The solution to this is to use the interactive jobs feature of the cluster.

Important

Commonly, to accomplish the above, people will run their work on the login nodes themselves, which starts to become a problem when many people do this. The login nodes are not designed for usage such as this and are really only intended for use as a gateway for people to connect to, organise their files and then submit work to the cluster. Doing work directly on the login nodes harms other users and we may kill your processes if we notice heavy load, so do read the below for a way to do this that is beneficial to the overall health of the system.

Starting an Interactive Session

Note

If you know you need to open a graphical display, first open an SSH connection to one of the login nodes with the additional flag -X, which enables X11 forwarding and is needed to open a graphical display,

ssh -X username@feynman.hpc.susx.ac.uk

Once you have connected to one of the login gateways, you can open an interactive session on the cluster by doing,

module load sge
qlogin -q interactive.q

This will schedule you a single slot on the cluster, just like with batch job, however it will then transfer you to one of the nodes and give you an interactive shell on that node, from which you can do your work.

It looks like this,

[mb325@feynman]:~/ $ qlogin -q interactive.q
Your job 3457893 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 3457893 has been successfully scheduled.
Establishing /cm/shared/apps/sge/current/cm/qlogin_wrapper session to
host node110.cm.cluster ...
Last login: Fri Apr 11 13:10:24 2014 from feynman.cm.cluster
[mb325@node110]:~/ $

So we see in the above that I was assigned a job slot number, and then given a shell on node110. When you exit the session (CTRL-D, or logout or exit), you will be disconnected from the node and any processes you had running there will be terminated.

[mb325@node110]:~/ $ logout
Connection to node110.cm.cluster closed.
/cm/shared/apps/sge/current/cm/qlogin_wrapper exited with exit code 0
[mb325@feynman]:~/ $

Warning

Highly important. It’s possible to lose connection to your qlogin session, have it still running, yet not be able to access it. For example, say you use your laptop to connect to the cluster, open an interactive session and start a long compile running. Then your laptop loses network connectivity so your ssh connection to the cluster will be disconnected, but your qlogin session remains running and is now in a state known as orphaned.

Imagine you had done hours of work inside a session only to not be able to reach it again!

To mitigate this you should always run your interactive sessions inside of a screen. I’ve added a section below on how to do this. Running qlogin inside of Screen.

Interactive Job Guidelines

Important

Interactive jobs use the cluster’s batch system, and as such you should use the resource carefully just like you would do with a batch job. They are also subject to the same limits of the queue, such as time limits for the length of a job - when the time runs out your interactive session will be closed.

This facility is not really intended for running large scale, high memory, or very very long running jobs, particularly parallel or multi-threaded jobs, as you will be utilising compute node resources just like any other job and therefore could disrupt those jobs if you use more resources than a single slot’s allocation.

Please see our job guidelines if you think you might run up against this.

Running qlogin inside of Screen

If you are not familiar with screen, then we have a short introduction to the tool Persistant sessions over SSH.

After connecting to a login node, run,

[mb325@feynman]:~/ $ screen
...
# At this point your screen will be cleared and you will be given a
# new shell session.
...
[mb325@feynman]:~/ $ qlogin -q interactive.q

Things will proceed as discussed above, but the difference now is that, by invoking screen first, your session runs under the screen process.

You can detach from the screen session by typing CTRL-A D as any time, and you’ll see your original session on the login node from before you typed screen,

[mb325@feynman]:~/ $ screen
[detached]
[mb325@feynman]:~/ $

You can then reattach to your screen session above by typing screen -r, and you will be back where you were. So if you lose your network connection after working through screen, you can recover your session by doing,

$ ssh mb325@feynman.hpc.susx.ac.uk

[mb325@feynman]:~/ $ screen -r

# And then I'll see the below, which was my terminal output before I
# got disconnected.

[mb325@feynman]:~/ $ qlogin -q interactive.q
Your job 3458090 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 3458090 has been successfully scheduled.
Establishing /cm/shared/apps/sge/current/cm/qlogin_wrapper session to
host node115.cm.cluster ...
Last login: Wed Jul 16 12:09:37 2014 from feynman.cm.cluster
[mb325@node115]:~/ $

Running multi-core or high-memory workloads interactively

If you wish to run a multi-threaded, or do some multi-process work interactively, you can request more slots be assigned to your interactive session just like when doing parallel work in batch Parallel Jobs,

qlogin -q interactive.q -pe openmp 4

where 4 here is the number of cores/slots you want for your job.

Similarly if you only need a single core, but know that you need more memory than a single slot provides (this is at most 4GB), then you should use the same technique so that you aren’t taking memory that should belong to other jobs on the system.