LSF in ALEPH
The ALEPH offline batch environment has been migrated from
from the old NQS system to the LSF (Load Sharing Facility)
system. This page is intended to provide information about LSF for Aleph
users.
A good introduction to LSF at CERN, including an overview
of the most useful user commands, is available at http://wwwinfo.cern.ch/pdp/lsf/LSF-at-CERN.html.
There is also a collection of user and administrator guides
at http://wwwinfo.cern.ch/pdp/lsf/index.html.
The table below describes the LSF queues that have been set up on aloha and shift50. The maximum job length in each queue is defined in terms of
"Normalised CERN Units" (NCU), the normalisation factor for shift50 is 1.3,
for aloha nodes it varies between 0.2 and 0.4. To get the maximum CPU time
allowed in each queue, divide the NCU limit by the normalisation factor. This
is given for shift50 in the table below.
There is also a maximum real elapsed time for each queue on shift50, beyond which the job will be stopped even if it has used little CPU time. This can happen
for example if your job is stageing in a lot of data and doing little analysis.
| Queue name |
Length (NCU) |
CPU time limit (shift50) |
Absolute time limit (shift50) |
Equivalent NQS queues |
| xu_8nm |
8 NCU minutes |
369 seconds |
185 minutes |
A_xux |
| xu_1nh |
1 NCU hour |
2769 seconds |
923 minutes |
B_xus + short C_xum jobs |
| xu_8nh |
8 NCU hours |
22153 seconds |
61.5 hours |
long C_Xum + short D_xul jobs |
| xu_1nd |
1 NCU day |
66461 seconds |
92.3 hours |
long D_xul jobs |
LSF is installed on the majority of aloha nodes and on shift50.
The configuration can always be tuned, taking
into account feedback from users. If you have any comments or problems with
the current setup, or if you notice ant strange behaviour,
please do not hesitate to
tell us. Note
that LSF is set up to run also on nodes usually used for interactive work.
LSF will automatically suspend batch jobs running on an aloha nodes
when it detects any interactive load on that node.
Obviusly this is not instantaneous, so the initial interactive response may be
degraded, but this degradation should only last a few seconds.
Statistics on LSF queues are collected every week. See the LSF performance
graphs for aloha
and shift50
Using LSF
In order to access the LSF commands you should do the following:
- Log on any aloha node with telnet. DO NOT USE rsh or rlogin
- Add /usr/local/lsf/bin to your PATH: set
path=( . $path /usr/local/lsf/bin)
- To test if you have valid tokens, you could try this test job typing the following command : bsub < ~closier/public/datelsf.job
- The output "Failed in an LSF library call: External authentication failed" means that you have probably not used telnet to login on the aloha machine. To solve the problem, you can type /usr/local/bin/krb/kauth to get the tokens.
You can then use any of the LSF commands to submit or monitor
jobs and queues. You can create and submit an ALPHA job using
alpharun, which has been modified to submit the job to LSF.
Once you have created a job you can submit it in the following way : bsub < myalpha.job.
The .job file must be executable: chmod +x myalpha.job
Hints
- Do not use the -e or -o options when submitting your job
from a directory on a locally mounted disk (because LSF can only return your
output by mail in this case).
- LSF will normally send the job to the first available node.
If you want to force the job to run on shift50, submit it with the option
-R shift.
You should do this if you are making extensive use of staged data.
This is the case for most Alpha jobs and is therefore the default in Alpharun.
If you want to force the job to run on aloha, submit it with the option -R aleph. You should do this if
you are using resources local to aloha (e.g. if your data or
program resides on the aloha scratch disk).
Please note that jobs submitted to the shift50 LSF queues from aloha take
precedence over jobs submitted to the shift50 LSF queues from shift50.
Please give your feedback on LSF to Marco Cattaneo or Joel Closier