Using the SLAC Batch Farm
If you need more cycles than you readily have your hands on, you're welcome to use the SLAC linux batch farm. The resources of the batch farm give you the ability to run multiple jobs concurrently.
The batch system is easy to use and offers substantially more CPU horsepower than can be obtained from the handful of interactive servers available. However, there are some important issues regarding the use of the batch farm as described below.
Test Drive. You can submit a simple batch job, consisting only of the UNIX command "hostname", which prints out the name of the host computer on which the job runs.
- Login to your SLAC Public account.
- At the prompt, enter: bsub hostname
This will submit a one-command batch job to the 'short' queue, which gives you about 2 minutes of CPU time (on fell-class machine). A message listing your job number and the queue the job was submitted to will be displayed:
Job <206422> is submitted to default queue <short>.
chuckp@noric06 $
- Check your email for a message with the name of your host machine; it should be similar to this example: email response.
Normally, a batch job is more complex and you will likely submit a shell or python script to perform the desired computation. The syntax for a more typical batch submission might look like this:
$ bsub -q long -R rhel50 myScript.py
This command submits a job to the 'long' queue (about 2 hours CPU time on a fell-class machine) and further specifies that the job be run on a machine running RedHat Enterprise Linux version 5. A log of the batch job is returned to you as email, and any files created will be directed per your script.
Shared Resources
Once you login to afs, your SLAC Public environment enables you to share a great many resources. For example, when you submit a job to the batch farm, the shared resources you are using include the:
- Interactive machines (e.g., noric, rhel5-32, etc.)
Note: For a complete list, see Public Machines at SLAC.
- Batch farm (non-interactive)
- NFS disks (Fermi group space)
- AFS disks (home directories and some Fermi group space)
- Xroot disks (Fermi storage for bulk data)
- Network facilities
Monitoring Resources
Remember, there are many users. When sharing these resources, it is important to avoid overloading them, thereby degrading the response times and causing jobs to fail for everybody. For example, if you are simultaneously running 100's of jobs that are doing a lot of I/O, the server struggles to handle the requests due to the load.
When running jobs on the batch farm, you can — and should — monitor the activity and state of these resources to ensure that you haven't inadvertently overloaded them. Ganglia monitors many of them, and serves as an early alert to a problem. Once you find the relevant monitoring page(s), select an appropriate time period for the plots, and don't forget to refresh the page with new data periodically.
Tip: The following ganglia links enable you to monitor:
Disk access is the single most likely bottleneck in a swarm of batch jobs. To avoid having colleagues beating down your door because your jobs are causing problems for them, take the time to do a bit of homework first and then be prepared to *carefully* monitor your project as it ramps up—and be equally prepared to kill those batch jobs of a problem materializes.
First, assess which disks you will be reading and writing from; examples include:
- Your $HOME directory in AFS (e.g., "dot" files, pfiles).
- Other AFS directories you own where code might be kept.
- Fermi NFS directories (e.g., /nfs/farm/g/glast/uXX and /afs/slac/g/glast/users).
- Xroot directories (e.g., results of a DataCatalog query).
Next, discover which servers are involved:
- For AFS disks, use something like this:
$ fs whereis /afs/slac/g/glast/ground/releases/analysisFiles/
File /afs/slac/g/glast/ground/releases/analysisFiles/ is on host afs00.slac.stanford.edu
the server is "afs00".
- For NFS disks, use something like this:
$ cd /nfs/farm/g/glast/u33
$ pwd
/a/sulky36/g.glast.u33
the server is "sulky36".
- For xroot, one does not know (even from day-to-day) on which server a particular file is stored on; so, you will just need to monitor the entire system for overloading.
Then, bring up the Ganglia web pages mentioned above, and zero in on the server(s) of interest.
Finally, how do you know if you are stressing the system? Before you begin to submit jobs, take a look at the CPU utilization and the disk and network I/O plots for each server to get an idea of the instantaneous baseline.
- As your jobs begin to run, any abrupt and significant increase in those two metrics indicates a significant load.
How much is too much? That's hard to say exactly, but here is where the situation can become very painful both to your jobs and to other users:
CPU > 90%
Disk I/O > ~50 MB/s on a single disk
- If you notice a large number of "nfs_server_badcalls", that is evidence of problems with the server and – if correlated with your batch jobs – it is probably your fault!
If a problem occurs, contact helpsoftlist@glast.stanford.edu.
Best Practices
There are some basic guidelines to keep in mind before you submit a batch job.
- Put analysis code and scripts in your afs home directories (which are backed up), and ultimately put your output files in nfs, e.g., the user disk
/afs/slac/g/glast/users/<username>
- Create a unique directory in /scratch for your batch job, e.g.,
mkdir -p /scratch/<userid>/${LSB_JOBID}
- Define this directory as your $HOME and then go there prior to running any ScienceTools/Ftools/etc.,
export HOME=/scratch/<userid>/${LSB_JOBID}
cd ${HOME}
This will automatically take care of PFILES being unique for your job and avoid overloading the /nfs user disk with large numbers of opens and closes. Create any new files in $HOME and then copy anything you wish to save at the end of your job.
If you are planning
a large batch operation, please inform and coordinate with SAS management (Richard Dubois).
- PFILES. When running multiple, simultaneous jobs on either interactive machines or on the
SLAC batch farm, be sure that each job is given a unique, local PFILE path in which to write its
parameter files. (One .par file is created by each ScienceTool or Ftool.) This is accomplished by
setting the $PFILES environment variable appropriately.
Why bother? If you do not specify a unique PFILE path for each job, these parameter files will
be created in your $HOME directory, i.e., $HOME/pfiles, and each job will attempt to write its .par
files to the same directory, causing an unfortunate and painful conflict if the same tools run
simultaneously. Not only will your jobs fail to give reliable results, but this sort of activity is
very demanding on file servers and can cause severely degraded performance for all users.
Ideally, you should
direct your pfiles to a local scratch space (the resulting parameter files are not something you
likely need to keep upon completion of the job). Whatever you do, DO NOT direct your writable PFILE
path to one of the GLAST user disks! (Such anti-social behavior will not go unnoticed.)
Tip: All SLAC-managed Linux machines are configured with scratch space. Most public
interactive machines have a space called '/usr/work'. Most (all?) batch machines have a large space
called '/scratch'. Desktop machines may also have '/scratch'. All machines have '/tmp', but this
space should not be used for large files as the space is usually limited and filling up /tmp can
cause a machine to crash. When using scratch space on public machines, always create a directory
with your own username, e.g., /scratch/<username>, into which all of your temporary files are
written. It is also crucial on batch machines to always clean up any temporary files you create or
that area will fill up.
Note: Remember that the batch farm is not interactive and it inherits whatever environment you
happen to have set up when you submit the job. Thus, all desired non-default ScienceTool/Ftool
parameters must be specified explicitly.
Below are examples of two approaches to managing $PFILES. Note that $PFILES consists of two lists
separated by a semicolon, ";". The first list contains one or more directory paths for the
ScienceTools (and FTOOLS) to use for writing and preferentially for reading parameter files, while
the second list specifies directories to be used as read-only reference, as needed.
Example 1 —Explicitly set $PFILES to a unique writable space (in conjunction with SCons
ScienceTools-09-15-05 build):
PFILES=/scratch/<uniqueIdentifier>/pfiles;/nfs/farm/g/glast/u35/ReleaseManagerBuild/redhat3-i686-32bit-gcc32/Optimized/ScienceTools/09-15-05/burstFit/pfiles
Note: This environment variable tells the ScienceTools (and FTOOLS) to use the first path in the list for writing and preferentially for reading, but use the second as a read-only reference, as needed. (Note the semi-colon between the r/w and read-only path elements.)
Example 2 — Set the home directory to the curent directory:
- Move to a unique directory:
mkdir uniqueDirectory
cd uniqueDirectory
- Set home directory to current working directory:
HOME=$PWD
- Run your environment setup script for the version of ScienceTools (and/or Ftools) that you wish to use.
- Cleanup. Be sure to perform a cleanup on /scratch after your jobs have completed!
Python Script. For an example of setup and cleanup routines from a python script, each with a unique environment variable, see new.py.txt and note the two FILE.write blocks. The first block creates the unique environment variables for each job, and the second block cleans up the scratch directory after the batch jobs have completed.
Commands You Need to Know
When using the batch farm, you need to know:
- bsub
- bjobs (-l)
- bkill
- bqueues
|
submit a job
get a summary of running jobs (-l gets a longer summary)
kill an errant job
get info on the queues |
Logging into a noric system is your first step. Jobs submitted from these machines will automatically go to the batch farm.
An easy way to operate is to use the GLAST nfs user space
/afs/slac/g/glast/users/
The batch
machines can access this space.
Submitting a job
The syntax is simple:
bsub -q [queueName] <command> |
|
Monitoring a Job
bjobs [-l]
Provides a summary of running jobs.
Tip: It also gives you a batch id for the job, which you can use with bkill.
bpeek <job-id>
Provides a 'peek' at the log file for a running batch job.
Tip: There are 'man' pages for all the batch commands, e.g., 'man bsub'. The batch system is
formally known as LSF (Load Sharing Facility), and a brief overview to the system can be read with
'man lsfintro'.
Other useful batch monitoring commands include:
bqueues
bqueues -l
long
lshosts
bmod
bhist
lsinfo
busers |
Summary status of entire batch system organized by queues.
Detailed summary status of the 'long' queue.
Very long listing of all batch machines along with their resources.
Change the queue for a submitted job.
Get history information for completed jobs.
List all 'resources' defined in batch system.
Summary of my batch activity. |
Error Codes on exit
Exit codes from batch jobs should be interpreted similarly to those from any other
unix program. Codes:
- 1-128 are exit codes generated by the job itself.
- 129-255 usually indicate that a signal was received and the value is the signal+128.
For example, an exit code of 131 means the job received signal 3 (= 131-128) which is SIGQUIT.
Note: A brief summary of the signal codes can be seen with the command: kill -l
Queue Information
bqueues (-l)
The standard queues we can submit to are:
short |
medium |
long |
xlong |
xxl |
2 minutes |
15 minutes |
2 hours |
16 hours |
130
hours |
Note: Times are specified for fell-class machines. (See Queue Limits and CPU Factors for more information.) |
Killing a Bad Job
bkill <id>
Assuming a batch id of 217304, the command to kill a bad job would be:
bkill 217304
Note: An id = 0 is a wildcard. To cancel all of your jobs, enter: bkill 0
Queue Limits and CPU Factors
(As of February 2, 2010) |
QUEUE |
MAX |
JL/U |
JL/H |
CPULIMIT |
+> |
don |
cob |
yili |
boer |
fell |
hequ |
Short |
- |
- |
- |
21m |
|
4m |
2m |
2m |
2m |
2m |
1.4m |
medium |
- |
1000 |
- |
168m |
|
31m |
22m |
20m |
17m |
15m |
12m |
long |
- |
1000 |
- |
22.3h |
|
4.1h |
2.9h |
158m |
134m |
121m |
92m |
xlong |
2800 |
- |
4 |
177.6h |
|
33h |
23h |
21h |
18h |
16h |
12h |
xxl |
800 |
400 |
2 |
1428h |
|
261h |
187h |
169h |
142h |
130h |
98h |
CPULIMIT is in "SLAC time" = wall-clock time * CPU_Factor |
|
CPU Factors: |
MODEL_NAME |
CPU_FACTOR |
Machine Names |
RS6k-370 |
0.19 |
[morgan] |
Ultra5 |
0.46 |
[pinto] |
UT1_440 |
1.00 |
[bronco] |
VA_867 |
2.11 |
[barb] |
PC_2660 |
2.81 |
[tori,orlov] |
PC_1400 |
3.36 |
[noma,morab] |
G5_2000 |
4.82 |
fuji(MacOS) |
AMD_1800 |
5.47 |
don, noric |
AMD_2000 |
7.65 |
cob, coma |
AMD_2200 |
8.46 |
yili,sdc,noric |
AMD-2600 |
10.00 |
boer,bali,orange,sdc |
INTEL_2660 |
11.00 |
fell,simes |
INTEL_3000 |
12.00 |
sdc |
INTEL_2930 |
14.58 |
hequ |
[ ] = no longer in service |
For more information about the various machine types, see: Public Machines at SLAC. |
Owned by: Tom Glanzman |
Last updated by: Chuck Patterson
01/31/2011
|
|
|