Migrating from obob_condor
or hnc_condor
If you previously used the HTCondor cluster at the University of Salzburg either
via obob_condor or
the hnc_condor
package of obob_ownft,
migrating to the new cluster is quite easy for you. This document provides
an overview what you need to do and what is different.
General
Although the underlying scheduling system changed from htcondor to slurm,
the general principles all still apply: You define your jobs by
writing a class that derives from a specific Job
class, specify the
resources needed and submit everything to a queue from which it then
gets executed when it is your turn.
However, there are some notable changes:
No more bombers
Both the interim solution as well as the new cluster do not have personalized virtual machines a.k.a. bombers anymore. They are replaced by two possibilities, depending on your use case:
Login Nodes: These machines are shared with all cluster users. They must not be used for real computations! Submitting jobs and some light development are fine, though. If you use too much RAM or CPU, you will get kicked out of the node!
Interactive Jobs: These are not available right now but will be the replacements for the bombers. You basically run a job that you can connect to via xpra, ssh or Visual Studio Code. This is for real computation that needs to be done interactively or for development that needs more RAM and / or CPU.
Warning
Until the interactive jobs are available, please continue to use your bombers for any interactive things.
New command line commands
All the condor_*
commands are not going to work anymore, of course. Here
are the most import counterparts of the slurm world.
What it does |
HTCondor |
Slurm |
---|---|---|
Show the jobs in the queue |
|
|
Get Information about the nodes |
|
|
Remove a queued or running job |
|
|
There are more commands, of course and all commands also have lots of parameters and switches. Take a look at the slurm website to know more.
Memory
If your Job
wanted to use more RAM than specified, the behavior of htcondor
was to kill the job, update the memory requirements and run the job again.
Slurm is just going to kill your Job
.
Time
Slurm requires you to specify how much time you jobs need to run.
If you use python
Migrating from obob_condor is straight forward, because plus_slurm
is a
drop-in replacement.
In your
requirements.txt
orenvironment.yml
replace the obob_condor dependency byplus_slurm
.Fix all your imports:
from obob_condor import Job, JobCluster
becomes:from plus_slurm import Job, JobCluster
.Make sure to not use the
adjust_mem
andowner
parameters when instantiatingplus_slurm.JobCluster
.Make sure to set the
request_time
parameter when instantiatingplus_slurm.JobCluster
.Remember that you need to be on the Login Node to submit your jobs.
And that’s it!
If you use Matlab
Code migration
Migrating your code from hnc_condor
to plus_slurm
is also really straight forward.
When initializing
obob_ownft
, make sure to requestplus_slurm
as a package:cfg.package.plus_slurm = true;
Replace all
obob_condor_*
calls withobob_slurm_*
.Make sure to not use
cfg.adjust_mem
andcfg.owner
.Make sure to set the
cfg.request_time
.Remember that you need to be on the Login Node to submit your jobs.
And that’s it!
Submitting from an interactive Matlab
Unfortunately, there is no X2Go available on the cluster login nodes. But you can use xpra as an alternative.
Make sure to install it on your local machine. Then open a terminal and enter:
xpra start --ssh="ssh" ssh://bXXXXXX@login5-gui.acsc.sbg.ac.at --start-child="/mnt/obob/bin/Matlab/R2020a/bin/matlab -desktop" --exit-with-children
Wait a bit and you will get a full-blown Matlab window that runs on the login node…