About:
The NIH Biowulf cluster is a GNU/Linux parallel processing system designed and built at the National Institutes of Health and managed by the Helix Systems Staff. Biowulf consists of a main login node and 2300 compute nodes with a combined processor core count of over 12000. The computational nodes are connected to high-speed networks and have access to high-performance fileservers. |
The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the intramural program. In the past we have been successful in obtaining support by citing published work which involved the use of our systems.
- When publishing data resulting from calculations done on the NIH Biowulf
cluster, please use the following citation in the article:
This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Md. (http://biowulf.nih.gov).
- Please report the publication of articles which made use of the cluster to the Helix Systems Staff by citing the reference in email to staff@biowulf.nih.gov.
Biowulf accounts require a pre-existing Helix account. A Biowulf account can be obtained by registering on the web here. Biowulf accounts not accessed within a 60 day period are automatically locked as a security precaution. A user may request to have his account unlocked by sending email to staff@biowulf.nih.gov.
The Biowulf system is accessed from the NIH network via ssh to biowulf.nih.gov. We recommend the PuTTY SSH client for Windows users as it is free, feature-rich, and widely used. Mac and Unix/Linux users can use the ssh client available from the command line in a terminal window.
Xwindows connection software is also recommended for GUI/visualization applications that are run on the login or computational nodes (e.g. x-povray or interactive SAS) display on your desktop workstation. We provide free Xwindows software for Macs and PCs to Helix Systems users.
Users log in to Helix and Biowulf using their NIH domain username and password.
The Bourne-Again SHell (bash) is the default shell on most Linux systems including Biowulf. Other available shells, including csh and tcsh, are listed in /etc/shells. To change your default shell, use the command:
chsh -s shell_name
where shell_name is the full pathname of the desired shell (e.g., /bin/csh).
Biowulf home directories are in a shared NFS (Network File System) filespace, therefore the access to files in your home directory is identical from any node in Biowulf. Note that your Biowulf home directory is the same directory as your Helix home directory.
E-mail can be sent from any node on Biowulf. However, all mail sent out from Biowulf will have a return address of helix.nih.gov (rather than the address of the node). In addition, mail sent to a user on a Biowulf node will be automatically forwarded to that user on helix.nih.gov. In other words, you can send, but not receive mail on Biowulf.
If you do not read mail on helix, be sure to set up a .forward file in your helix home directory.
You can communicate with the Biowulf/Helix staff by sending email to: staff@biowulf.nih.gov
# of nodes | processor cores per node | memory | network |
24 |
16 x 2.6 GHz Intel E5-2670
hyperthreading enabled
20 MB secondary cache
|
256 GB
|
1 Gb/s Ethernet
|
328 |
12 x 2.8 GHz Intel X5660
hyperthreading enabled
12 MB secondary cache
|
24 GB
|
1 Gb/s Ethernet
|
352 |
8 x 2.67 GHz Intel X5550
hyperthreading enabled
8 MB secondary cache
|
320 x 24 GB
32 x 72 GB
|
1 Gb/s Ethernet
|
250 |
8 x 2.8 GHz Intel E5462
12 MB secondary cache
|
8 GB
|
1 Gb/s Ethernet
16 Gb/s Infiniband
|
1 |
32 x 2.5 GHz AMD 0pteron 8380
512 KB secondary cache
|
512 GB
|
1 Gb/s Ethernet
|
232 |
4 x 2.8 GHz AMD Opteron 290
1 MB secondary cache
|
8 GB | 1 Gb/s Ethernet |
471 |
2 x 2.8 GHz AMD Opteron 254
1 MB secondary cache
|
40 x 8 GB
226 x 4 GB
91 x 2 GB
|
1 Gb/s Ethernet
160 x 8 Gb/s Infinipath
|
289 |
4 x 2.6 GHz AMD Opteron 285
1 MB secondary cache
|
8 GB | 1 Gb/s Ethernet |
389 |
2 x 2.2 GHz AMD Opteron 248
1 MB secondary cache
|
129 x 2 GB
66 x 4 GB
|
1 Gb/s ethernet
|
91 |
2 x 2.0 GHz AMD Opteron 246
1 MB secondary cache
|
48 x 1 GB
43 x 2 GB
|
1 Gb/s Ethernet
|
All nodes are connected to a 1 Gb/s switched Ethernet network while sub-sets of nodes are on high-performance Infinipath or Infiniband networks. The hostnames for these interfaces are p2 through p2576. The hostname for biowulf's internal interface is biowulf-e0 (i.e., nodes wishing to communicate with the login machine, should use the name biowulf-e0 rather than biowulf).
The ethernet switches are Brocade (Foundry) FCX 648S, and FESx 448 switches. The backbone switch is a Brocade (Foundry) RX-32 using 10 Gb/s fiber links to the FCX/FESxs edge switches.
For those applications that can benefit from increased bandwidth and lower latency than Ethernet, there are two high performance switched networks available:
- 250 8-core nodes are on a DDR Infiniband network connected to a Voltaire ISR 2012 infiniband switch. DDR Infiniband provides 16 Gb/s bandwidth at very low latencies.
- 134 dual-processor nodes are connected to and Infiniband network via QLogic InfiniPath adapters. This network is similar to the above mentioned Infiniband network but with two major differences: it is slightly lower latency but runs at a lower bandwidth as well (8 Gb/s max).
The login node (biowulf.nih.gov) is an HP DL380 (two 2.67 GHz Xeon X5550, eight cores) with 24 GB RAM. It is a minimal system designed for log-in, light-weight tasks, 64-bit compiling/development and the submission of jobs to the batch system for runs on the cluster. No compute intensive jobs are permitted on the login node.
There are several options for disk storage on Biowulf; please review this section carefully to decide where to place your data. Contact the Biowulf systems staff if you have any questions. Except where noted, there are no quotas, time limits or other restrictions placed on the use of space on the Biowulf system, but please use the space responsibly; even hundreds of terabytes won't last forever if files are never deleted. Disk space on the Biowulf system should never be used as archival storage. |
|
Users who require more than the default disk storage quota should fill out the online storage request form. |
Location | Creation | Backups | Speed | Space | Available from | |
/home | network (NFS) | with Helix account | yes | high | 8 GB default quota | B,C,H |
/scratch (nodes) | local | created by user | no | medium | 30 - 130 GB dedicated while node is allocated (**) | C |
/scratch (biowulf) | network (NFS) | created by user | no | low | 500 GB shared | B,H |
/data | network (NFS) | with Biowulf account | yes | high | 100 GB default quota | B,C,H |
Each Biowulf node has a directly attached disk containing a /scratch filesystem. Note that scratch space is not backed up to tape, and thus, users should store any programs and data of importance in their home directories. Use of /scratch on the batch nodes should be for the duration of your job only. It is your job's responsibility to check for sufficient disk space. Your job may delete any and all files from /scratch to make space available (use the clearscratch command). Please use /scratch instead of /tmp for storage of temporary files.
/scratch on the login node (biowulf.nih.gov) is actually a shared (NFS) filesystem, accessible from both the login node and Helix. Files on this filesystem which have not been accessed for 14 days are automatically deleted by the system.
These are RAID-6 filesystems mounted over NFS from one of the following, all of which are configured for high availability: two NetApp FAS3050 and eight FAS3170 filers, and a DataDirect Networks S2A9900 storage system with eight fileservers. This storage offers high performance NFS access, and is exported to Biowulf over a dedicated high-speed network. /data is accessible from all computational nodes as well as Biowulf and Helix, and will be the filesystem of choice for most users to store their large datasets. Biowulf users are assigned an initial quota of 100 GB on /data; please contact the Biowulf staff if you need to increase your quota.
Note: your /data directory is actually physically located on filesystems named /spin1, /gs1, or /gs2. The /data directory consists of links to one of those filesystems. ALWAYS refer to your data directory through the /data links as opposed to the physical location because the physical location is subject to change based on administrator needs. In other words, use /data/yourname rather than (for example) /gs1/users/yourname in your scripts.
In addition to all data being regularly backed up to tape, snapshots are available that allow users to recover files that have been inadvertently deleted. For more information on backups and snapshots, please refer to the File Backups section on the Helix website.
If a user's data directory is located on the /spin1 storage area, it is possible that the size of snapshots in the user's data directory may cause a user to have less available storage space than he believes he should.
Each data directory on /spin1 has its own snapshot space that is separate from a user's regular file or data space. However, if a user deletes too much data within a short period of time (a week or less), it is possible that the snapshot space for that data directory can be exceeded. When this occurs, a user's regular file space (allocated quota space) will be used as additional snapshot space. Unfortunately there is no way for a user to either determine whether he has encountered this situation or to actually delete his snapshots. However, if a user does not see an expected increase in available space in the output of the 'checkquota' command following the cleanup of large files, the possibility exists that he has encountered this situation and should contact the Helix/Biowulf staff for assistance.
Use the checkquota command to determine how much disk space you are using:
$ checkquota Mount Used Quota Percent Files Limit /data: 70.4 GB 100.0 GB 70.41% 307411 6225917 /data(SharedDir): 11.4 GB 20.0 GB 56.95% 6273 1245180 /home: 2.0 GB 8.0 GB 25.19% 11125 n/a mailbox: 74.7 MB 1.0 GB 7.29%
The Biowulf Cluster is highly heterogeneous with respect to processor speed, memory size and networks. Nodes can contain between 2 and 32 processor cores with memory sizes from 1G to 512G, on message-passing networks that range from very fast (16 Gb/s) to moderate (1 Gb/s) in bandwidth. This can initially create confusion on the part of users in deciding on what nodes the need to run their work. However, once the cluster hardware is understood it should be quite easy for users to ascertain what nodes are suitable for their purpose without over-using resources or denying resources to others. |
Here are some common errors that beginners make when using the Biowulf cluster:
- Running fewer processes on nodes than there are processors. Please make sure you understand that the cluster nodes each have 2,4 or 8 processors, and that you need to be using all of them to maximize your productivity. There is one exception: when you need more memory per process than is possible to allocate without leaving processors idle (e.g. you need 8G of RAM per process to do your job - you can allocate dual-processor 8G nodes, leaving one processor idle.)
- Not monitoring your jobs. If your MPI job is running poorly because it doesn't scale well to the requested number of processors, then you're wasting resources and denying them to others. This applies to single-process jobs as well. If you're not sufficiently loading the system(s) you're running on, then you need to know about it to correct the problem(s).
- Consuming resources with minimal gain. Please keep jobs running efficiently: doubling the number of processors in a parallel job to take 3% off of your total wall-time is not efficient computing. Cluster space/time is valuable and it is the goal of this staff to maximize work done on the cluster by all users.
- Not understanding the batch system. Please read the documentation on this page so that you understand how to request relevant resources for a job while avoiding the irrelevant ones (e.g. requesting Infinband nodes when not using the Infiniband network or large-memory nodes when running small-memory tasks).
- Not asking for help. When in doubt, drop us an email or give us a call, it's what we're here for.
Clarification of terms used in this guide.
- Node: A node is a computer containing one or more processors. Sometimes simply called a "machine" or "box".
- Processor: A processor is a cpu, containing one or more "cores". Each core can run application process at any given time.
- process: The execution instance of an application. An MPI program will typically consist of 2 or more processes. Some program processes can have more than one thread of execution.
Node categories on the Biowulf cluster:
- Login nodes: These are nodes used for program development, compiling, debugging and submission of jobs to the batch system. There is currently one login node (biowulf.nih.gov). Please do not run compute-intensive code on login nodes.
- Batch nodes: the vast majority of nodes in the cluster are used for production jobs run under the control of the batch queueing system. These nodes are dedicated to your job.
Batch (or queueing) systems allow the user to submit jobs to the cluster without regard for what resources (cpu, memory, networks) are available. When those resources become available the batch system will start the job. There are many batch systems available; the one used on the Biowulf cluster is Altair's PBS Pro.
In order to run a job under batch you must submit a script which contains the commands to be executed. The command to do this is qsub. The qsub command on Biowulf is heavily customized and this documentation should be used in lieu of that provided by Altair. The batch script (or batch command file) is simply a Linux shell script with optional commands to PBS (called directives).
A very minimal batch script example:
#!/bin/bash # # this file is myjob.sh # #PBS -N MyJob #PBS -m be #PBS -k oe # cd $PBS_O_WORKDIR myprog < /data/me/mydata
One could simply run this as a shell script on the command line, however when executed by the batch system, the lines beginning with #PBS have special meaning. These directives are as follows:
-N: name of the batch job. -m: send mail when the job begins ("b") and when it ends ("e"). -k: keep the STDOUT ("o") and STDERR ("e") files.
Other directives can be found on the qsub man page.
The environment variable $PBS_O_WORKDIR is the directory which was the current directory when the qsub command was issued.
Note that if the "myprog" program above ordinarily sends its output (STDOUT) to the terminal, that output will appear in a file called MyJob.o<jobnumber>. Any errors (STDOUT) will appear as MyJob.e<jobnumber>.
The simplest form of the qsub command is:
% qsub -l nodes=1 myjob.bat
The resource list (-l) here consists of the number of nodes (1). You must specify a number of nodes, even if it is one.
The batch system allocation is done by nodes (not processors). When you are allocated a node by the batch system, your job is the only job running on the node. Since a node may contain between 2 and 24 processing cores, you should ensure that you utilize the node fully (see the swarm program below).
Parallel jobs running on more than 2 cpus will need to specify more than 1 node. Additionally, since you will want a parallel job running on nodes of identical clock speed, you will probably want to specify a node type as well:
qsub -l nodes=8:o2800 myparalleljob
Node types currently supported are (from fastest to slowest):
x2800 2.8 GHz Xeon e2666 2.67 GHz Xeon o2800 2.8 GHz Opteron o2600 2.6 GHz Opteron o2200 2.2 GHz Opteron o2000 2.0 GHz Opteron
Nodes can also be selected based on memory size. There are two ways to select nodes based on memory
mN: the amount of memory per processor core, where N is 1, 2, or 4 gN: the total amount of memory in a node, where N is 4, 8, 24 or 72Specifying memory per core. Use the "m" property where is the size of memory in gigabytes. Most nodes in the cluster are "m1" (ie, 1 GB/core). You should use this property for most jobs requiring a specific memory size. It is particularly useful for submitting swarms, since 'swarm' may place your jobs on nodes with differing numbers of cores. This property ensures that each process will have access to the same amount of memory regardless of the number of cores on the node.
Specifying memory per node. Use the "g" property where is the total memory in the node in gigabytes. This property should be used when your job requires the entire memory of the node for a single process.
Examples:
m2 2 GB/core m4 4 GB/core g4 4 GB/node g72 72 GB/node
(Note: Since the primary purpose of the g72 nodes is to provide a large memory, it is permissible to underutilize the number of processors. That is, it's ok to run less than 8 processes per node).
Nodes can also be selected based on network:
gige gigabit ethernet ib infiniband
Properties based on cores per node:
c2 2 cores/node c4 4 cores/node c16 16 cores/node c24 24 cores/node
Note: Don't make additional specifications unless you need to! In most cases, the less you specify the better. The more flexibility the scheduler has in allocating nodes for you job, the less likely your job will wind up having to wait for specific resources.
Not all combinations of processor, memory and network are available. The chart below shows available configurations (in green) along with the PBS node properties (in bold). To see how many nodes are available in each configuration, use the freen command (below).
Processor | Cores | m1 1 GB mem. per core | m2 2 GB mem. per core | m4 4 GB mem. per core | g4 4 GB total mem. | g8 8 GB total mem. | g24 24 GB total mem. | g72 72 GB total mem. | Network |
x2800 2.8 GHz Xeon | c24 | gige Gigabit Ethernet | |||||||
e2666 2.67 GHz Xeon | c16 | ||||||||
o2800:dc 2.8 GHz Opteron | c4 | ||||||||
o2600:dc 2.6 GHz Opteron | c4 | ||||||||
o2800 2.8 GHz Opteron | c2 | ||||||||
o2200 2.2 GHz Opteron | c2 | ||||||||
o2000 2.0 GHz Opteron | c2 | ||||||||
2.8 GHz Xeon | 8 | ib Infiniband | |||||||
2.8 GHz Opteron | 2 | ipath Infinipath |
- At this time, any node the system allocates to you will be a 2 core node. You have to explicitly request a 4-, 8-, 16- or 24-core node with the appropriate "c" property. The exception to this is when nodes are allocated by the swarm command, which ensures that the correct number of processes are started on each type of node.
- Your job may not run properly if your startup file (.bashrc, .cshrc, etc)
contains commands which attempt to set terminal characteristics or do output to
STDOUT. Such commands can be skipped by testing the environment variable
PBS_ENVIRONMENT:
#csh if ( ! $?PBS_ENVIRONMENT ) then do terminal stuff here endif
- If your default shell is csh or tcsh, your job.o###### files will contain
the warning:
Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.
This warning can be ignored. Alternatively, you can change your shell to bash (with the chsh command) or tell qsub to use a different shell with:qsub -l nodes=1 -S /bin/sh myscript
For jobs requiring between 72 and 512 GB of memory there are twenty-five very large memory (VLM) nodes: 24 are configured with 256 GB and a single one with 512 GB. Each node has 32 cores and the 512 GB node has a 3 TB local /scratch. In contrast to most nodes, these nodes can be shared amongst jobs. To allocate VLM nodes the memory size requested, and optionally, the number of processors requested must be specified to qsub:
% qsub -l nodes=1,mem=128gb myvlmjob % qsub -l nodes=1,mem=384gb,ncpus=4 bigjob.batJobs requiring between 72-250 GB allocate a 256 GB node and jobs larger than 250 GB allocate the 512 GB node.
-r y|n this switch controls whether your job is restartable. If one of the nodes that your job is running on should hang, the operations staff will often restart the job. If you do not want your job to be restarted, use "-r n" with qsub. -v varlist this switch allows you to pass one or more variables from the qsub command line to your batch script. For instance, with "qsub -v np=4 myjob", the myjob script can use the $np variable with a value of 4. You can also list an environment variable without a value, and that envvar will be exported with its current value. -I for interactive jobs, see below. -V export all environment variables to the batch script. Useful when running X11 clients on interactive nodes, see below.
See the qsub man page for additional options.
Using the batch system for interactive use may sound like an oxymoron, but this allows you to run compute-intensive processes without overloading the Biowulf login node.
The following example shows how to allocate an interactive node:
biobos$ qsub -I -V -l nodes=1 qsub: waiting for job 2011.biobos to start qsub: job 2011.biobos ready p139$ exit logout qsub: job 2011.biobos completed biobos$
Note the following:
- The qsub command, used with the -I switch, automatically logs you in to the first allocated node.
- Use the -V switch to copy your env variables to the interactive batch session (useful when running X11 applications)
- As with all batch jobs, the node is dedicated to the user's job.
- The default allocation is a 2-core, 4 GB node. Other nodes may be allocated using the appropriate node specification to the qsub command.
- The maximum allowable number of interactive jobs is 2.
- By default, the job will be terminated after 8 hours. You may extend this time to a maximum of 36 hours by using the "walltime" attribute to the qsub command.
- The -V switch exports all current environment variables to the batch session. This is required in order to run an X11 client on the interactive node.
- As in batch, the $PBS_NODEFILE variable is the name of a file which contains the allocated nodes.
- The nodes are unallocated when the interactive session is ended using the exit command. Please remember to exit your interactive sessions so that node resources are freed up.
See the man pages for details on the following PBS commands:
qstat -u [username] list jobs belonging to username qdel [jobid] delete job jobid qdel -Wforce [jobid] delete a job on a hung node qselect -u [user] -s [state] select jobs based on criteria
For example, to delete all of your current jobs:
qdel `qselect -u myname`
To delete only your queued jobs:
qdel `qselect -u myname -s Q`
There are no time limits for jobs running in the batch queue. However, while debugging a program, or if there is otherwise a possibility that your job could "runaway" due to a programming error, please use the walltime switch to limit the time your job can run before it is terminated by the batch system. For example, to limit your job to 72 hours use "-l walltime=72:00:00" as an argument to the qsub command.
Scheduling on Biowulf is done using a Fair Share algorithm. This means that, when more jobs are waiting to run than can be started, the next job to run will be the one belonging to the user with the least amount of system usage during the previous 7 days. This should allow users to submit as many jobs to the queue as they would like without concern that they will take an unfair amount of processing time.
The batch system enforces a maximum number of jobs and cpus (or cores) allocated to each user. This number can vary depending on system load and other factors. To see the current limits, use the batchlim command:
$ batchlim Batch system max jobs per user: 4000 Batch system max cores per user: 1280 Max Cores Queue Per User ----------- --------- norm 1280 ib 384 vol 256 norm3 128 g72 128 ipath 128 gpu 96 volib 64 Max Mem Max Mem Per User on System ---------- ----------- DL-785 mem 512gb 512gb
Most user jobs run in the norm queue, so in the example above, the maximum per user allocation is 300 cores (which could be 150 x 2p nodes or 75 x 4p nodes).
Note: You should never specify a queue when submitting a job. Jobs will be automatically directed to the appropriate queue by the batch system. You should only specify the minimum resources needed by your job, e.g. memory or cores.
Because each Biowulf node has between two and 24 processor cores, but the PBS batch system allocates nodes rather than cores, submitting multiple single process jobs to the system results in a poor utilization of processors (i.e., only one processor per node).
In the case of simply running two processes on a single dual-processor node, the following example uses the wait command to prevent the batch command script from exiting before the application processes have finished:
#!/bin/bash # # this file is myjob.sh # #PBS -N MyJob #PBS -m be #PBS -k oe # myprog -arg arg1 < infile1 > outfile1 & myprog -arg arg2 < infile2 > outfile2 & wait
Note how this script runs 2 instances of a program by putting them in the background (using the ampersand "&"), and then using the shell wait command to make the script wait for each background process before exiting.
When running many single-threaded jobs, setting up many batch command files can be cumbersome. The swarm command can be used to automatically generate batch command files and submit them to the batch system.
The swarm Command
swarm allows you to submit an arbitrary number of serial jobs to the batch system by simply creating a command file with one command per line. swarm automatically creates batch command files and submits them to the batch system. Two commands are submitted for each node, making optimal use of the processors.
Here is an example command file:
myprog -param a < infile-a > outfile-a myprog -param b < infile-b > outfile-b myprog -param c < infile-c > outfile-c myprog -param d < infile-d > outfile-d myprog -param e < infile-e > outfile-e myprog -param f < infile-f > outfile-f myprog -param g < infile-g > outfile-g
The command file is submitted using the following command:
swarm -f cmdfile
The result is 4 jobs submitted to the batch system, 3 jobs with 2 processes each, and the last with a single process.
When submitting a very large swarm (1000s of processes), the bundle option to swarm should be used:
swarm -f cmdfile -b 40
If cmdfile contains 2500 commands, approximately 63 bundles of 40 commands each will be created and submitted as 32 batch jobs (2 bundles per job, one for each processor on a node).
Note1: swarm will correctly allocate the correct number of processes to 2-, 4-, 16- or 24-core nodes.
Note2: if the programs you are running via swarm either (1) require more than 1 GB memory, or (2) are multi-threaded, you must specify additional information to swarm.
See the swarm documentation for more details.
Resubmitting part of a swarm: If a few jobs in your swarm fail to complete, you can determine the batch script associated with those jobs using the 'submitinfo' command. You can then resubmit just those jobs. For example, 2 jobs in this swarm, with job ids 4104732 and 4104738, did not complete successfully due to a typo in the original swarm command file. You can edit the batch scripts and resubmit just these two jobs.
biowulf% submitinfo 4104732 4104739: (/home/susanc) qsub -l nodes=1:c16 ./.swarm/ctrl-swarm48n32646 biowulf% submitinfo 4104738 4104738: (/home/susanc) qsub -l nodes=1:c16 ./.swarm/ctrl-swarm47n32646 [...edit the 2 batch scripts ctrl-swarm48n32646 and ctrl-swarm47n32646...] biowulf% qsub -l nodes=1:c16 /home/susanc/.swarm/ctrl-swarm48n32646 410836.biobos biowulf% qsub -l nodes=1:c16 /home/susanc/.swarm/ctrl-swarm47n32646 410837.biobos
The multirun command
If you wish to submit more than 2 single-threaded jobs but want them under control of a single job, then an mpi "shell" can be used (note: In many cases this will not be an optimal use of resources. Unless all processes exit at roughly the same time, idle nodes will not be freed by the batch system until the last process has exited).
The basic procedure is as follows (generation of these scripts can be done automatically by writing a higher order script):
1. Create an executable shell script which will run multiple instances of your program. Which will run depends on the "mpi task id" of the instance.
#!/bin/tcsh # # this file is run6.sh # switch ($MP_CHILD) case 0: your_prog with args0 breaksw case 1: your_prog with args1 breaksw case 2: your_prog with args2 breaksw case 3: your_prog with args3 breaksw case 4: your_prog with args4 breaksw case 5: your_prog with args5 breaksw endsw
2. Use mpirun in your batch command file to run the mpi shell program (multirun):
#!/bin/tcsh # # this file is myjob.sh # #PBS -N MyJob #PBS -m be #PBS -k oe # set path=(/usr/local/mpich/bin $path) mpirun -machinefile $PBS_NODEFILE -np 6 \ /usr/local/bin/multirun -m /home/me/run6.sh
3. Submit the job to the batch system:
qsub -l nodes=3 myjob.sh
This job will run 6 instances of the program on 3 nodes.
File Transfers
A modified version of the 'scp' file transfer program called 'scp-hpn' is available for high-performance transfers averaging over 100 MB/s on 1 Gb/s network paths. Since the program requires connecting to a special network port on the biowulf login node, the client should be installed on your workstation or server and files pushed/pulled from that system.Download scp-hpn from http://biowulf.nih.gov/hpn.tar.gz (or hpn_i386.tar.gz for 32-bit systems), place the hpn.tar.gz file in the top level of your home directory on your Linux workstation or server and run the command "tar xfvz hpn.tar.gz". See the README file included in the distribution for additional details.
Please note that the provided version of the program was built on Biowulf which runs RedHat Linux 5. As such, it will work fine on other RedHat or derivative linux distributions. However, it is not guaranteed to work on other Linux distributions (or Mac) and you will likely need to build your own program binaries if you run a non-RedHat system. Additional information may be found here: http://www.psc.edu/networking/projects/hpn-ssh/.
Some licensed products such as Matlab are available on the Biowulf cluster. There are a limited number of licenses. Thus, each user who wants to submit a job that requires a license needs to specify that resource on the qsub command line. For example, a job requiring a Matlab license would be submitted as
biowulf% qsub -l nodes=1,matlab=1 jobscript
The individual application pages contain details about the resources that need to be specified.
The utility 'licenses' can be used on Biowulf to get a list of total and free licenses. Some licenses are limited so that an individual user can only run a certain number of simultaneous jobs; these limits are also listed in the 'licenses' output.
[user@biowulf]$ licenses License Resource Total Free Per-User Limit --------------------------------------------------------------------------- Mathematica math 7 5 4 Matlab matlab 23 20 5 Bioinformatics Toolbox matlab-bio 2 0 - Compiler - 1 1 - Curve Fitting Toolbox matlab-curve 3 2 2 Identification Toolbox matlab-ident 2 2 - Image Toolbox matlab-image 4 3 2 Neural Network Toolbox matlab-neural 2 2 - Optimization Toolbox matlab-opt 4 3 2 PDE Toolbox matlab-pde 2 2 - Signal Toolbox matlab-signal 6 6 3 SimBiology Toolbox matlab-simbio 1 1 - Simulink matlab-simulink 1 1 - Statistics Toolbox matlab-stat 7 4 4 Symbolic Toolbox matlab-symb 2 2 - Wavelet Toolbox matlab-wave 2 2 - SAS sas 48 48 48
Once a batch job has been submitted it can be monitored using both command line and web-based tools.
Using either the jobload command or the user job monitor (see below), you can determine the overall behavior of your job based on the load of each node. Perfectly behaving jobs will have loads of near 100% on all nodes. If the nodes in a parallel job are running at loads below 75% (or are green), the job probably isn't scaling well, and you should rerun the job with fewer nodes. Loads of less than 75% for non-parallel jobs may mean that not all processors are being used, or that the job is i/o bound. Contact the Biowulf staff for help in deciding whether your job is making best use of its resources.
The sections below provide information about various ways of monitoring your jobs.
qstat
The qstat command reports the status of jobs submitted to the batch system. 'qstat -a' shows the status of all batch jobs; 'qstat -n' shows, in addition, the nodes assigned to each job. Qstat with the "-f" switch gives detailed information about a specific job, including the assigned nodes and resources used. See the man page for 'qstat' for more details.
freen
The freen command can be used to determine the currently available nodes (free/total):
% freen Free Nodes by memory-per-core Cores m0.5 m1 m2 m4 Total ------------------ DefaultPool ----------------------- o2800 2 / / 13/211 / 13/211 o2200 2 / 117/300 62/62 / 179/362 o2000 2 7/42 38/38 / / 45/80 ------------------ Infiniband ----------------------- ib 8 / 24/250 / / 24/250 ipath 2 / / 69/126 / 69/126 ------------------ OnDemand ----------------------- e2666 8 / / 0/320 / 0/320 o2800:dc 4 / / 19/226 / 19/226 o2800 2 / 53/90 / 40/40 93/130 o2600:dc 4 / 0/41 37/212 / 37/253 Free nodes by total node memory Cores g4 g8 g24 g72 Total ------------------ DefaultPool ----------------------- o2800 2 13/211 / / / 13/211 o2200 2 62/62 / / / 62/62 ------------------ OnDemand ----------------------- e2666 8 / / 0/320 11/32 11/352 e2400 8 / / 0/42 / 0/42 o2800:dc 4 / 19/226 / / 19/226 o2600:dc 4 / 37/253 / / 37/253 -------------------- DL-785 ------------------------ Available: 31 processors, 504.2 GB memory
jobload
The jobload command is used to report the load on nodes for a job or for a user:
$ jobload juser Jobs for juser Node Cores Load 2383856.biobos p1922 4 100% 2383858.biobos p1926 4 100% 2383859.biobos p1918 4 100% 2383862.biobos p1920 4 99% 2383863.biobos p1928 4 79% 2383865.biobos p1919 4 85% Core Total: 24 User Average: 93%
All but 2 nodes allocated to this user are running to capacity.
$ jobload 869186 869186.biobos Node Load p554 100% p555 100% p557 100% p564 100% p565 100% p566 101% p567 100% p568 100% Job Average: 100%
This shows a well-behaved parallel job.
# jobload 763334 763334.biobos Node Load p1495 25% p1497 0% p1498 0% p1500 0% p1501 0% p1503 0% p1504 0% p1507 0% Job Average: 3%
This last example shows a improperly running parallel job.
Monitoring memory usage
If your jobs require less than 0.5 GB of memory per process, they will run successfully on any Biowulf node. If they require a minimum per-core or total node memory, you will need to specify the node type to ensure that the jobs run on a node with sufficient memory.Running jobload with the -m option will show the memory usage of your jobs.
biowulf% jobload -m user1 Jobs for user1 Node Cores Load Memory Used/Total (G) 1898152.biobos p1747 4 25% 5.7/8.1 1898156.biobos p1752 4 25% 4.7/8.1 1898157.biobos p1753 4 25% 5.9/8.1
The jobs above for user1 are appropriately sized for the allocated nodes.
biowulf% jobload -m user2 Jobs for user2 Node Cores Load Memory Used/Total (G) 2016779.biobos p889 8 12% 2.2/74.2 2016780.biobos p895 8 12% 2.1/74.2 2016781.biobos p887 8 12% 2.1/74.2 2016782.biobos p881 8 12% 2.1/74.2
The jobs for user2 have allocated the g72 nodes, but are only using ~2 GB of memory on each node. This is an inappropriate use of resources.
Some jobs may fluctuate in their memory usage. The batch system will track the resources used for each job, and if the -m e flag to qsub is used, this information will be emailed to the user at the end of the job. The value resources_used.mem will report the peak memory used by the job. This value should be used to allocate nodes with appropriate memory.
For example, the job below was submitted with qsub -m be -l nodes=1:g8 myscript. At the end of the job, the user will get a mail message along the following lines:
From: admDate: October 13, 2010 9:39:29 AM EDT To: user3@biobos.nih.gov Subject: PBS JOB 2024340.biobos PBS Job Id: 2024340.biobos Job Name: memhog2.bat Execution terminated Exit_status=0 resources_used.cpupercent=111 resources_used.cput=00:18:32 resources_used.mem=7344768kb resources_used.ncpus=1 resources_used.vmem=7494560kb resources_used.walltime=00:17:05
The batch system reports that the job has used 7344768kb, or 7.3 GB. This job is appropriately sized for the 'g8' nodes.
jobcheck
The jobcheck command can be used to check the status of exited jobs. In particular it can be used to identify jobs that have been deleted due to their having exceeded node memory limits.Show info for a particular job (the "D" indicates the job was probably deleted due to exceeding memory):
$ jobcheck 1019880 Job ID Job name Walltime cpu% MemLim MemUsed Estat ------------ ---------------- --------- ---- ------- ------- ----- 1019880 sw3n27536 19:15:45 106 22592 22598 271 D
Show all jobs exiting with a non-zero exit status for the past 2 days:
$ jobcheck -d 2 Showing jobs for the past 2 days. Job ID Job name Walltime cpu% MemLim MemUsed Estat ------------ ---------------- --------- ---- ------- ------- ----- 1019880 sw3n27536 19:15:45 106 22592 22598 271 D 1019899 sw12n29227 1:00:25 96 22592 22721 271 D 1024544 sw1n25060 0:03:32 420 22592 8835 271
Show all jobs which have exited during the past day:
$ jobcheck -d 1 -a Showing ALL jobs including those with a zero exit status. Showing jobs for the past 1 days. Job ID Job name Walltime cpu% MemLim MemUsed Estat ------------ ---------------- --------- ---- ------- ------- ----- 1019880 sw3n27536 19:15:45 106 22592 22598 271 D 1024543 sw1n23515 4:08:50 1210 22592 10148 0 1024544 sw1n25060 0:03:32 420 22592 8835 271 1024588 sw1n26558 5:02:38 1409 22592 10339 0 1024599 sw1n27070 1:31:56 504 22592 6436 0
There are other options as well:
$ jobcheck Usage: jobcheck jobnumber jobcheck [-a] -d days jobcheck [-a] -s start [-e end] where: -a show all jobs including zero exit status -d last N days to search -s datetime to begin search in 'YYYY-MM-DD hh:mm' format -e datetime to end search (optional) -v display start/end times
cluster monitor (web-based)
The web page at http://biowulf.nih.gov/sysmon lists several ways of monitoring the Biowulf cluster. The "matrix" view of the system shows the load of each node in discrete blocks of 100. For instance, block 17, row 3, column 4 (node p1734) shows a green dot, indicating that the system is 50% loaded (in this case a load of 2, where a load of 4 would be ideal). A healthy job would show yellow dots, an over-loaded job shows red and unused or severely under-loaded systems are indicated with a blue dot.
The color of the dot indicates the load percentage:
- white - the node is down
- blue - idle (load <10%)
- green - load >10% <60%
- yellow - load >60% <110%
- red - load >110%
Rolling over a dot will expose the node name and the system load (2-processor systems are at 100% load when the system load is 2, 4 processor systems are at 100% at 4, 8-processors systems at 8, etc).
Clicking on the dot corresponding to a node will result in a display of the process, cpu, disk and memory status for that node (output from the top and df commands).
The status of specific batch jobs can be monitored by first clicking on List Status of Batch Jobs which gives output similar to qstat, and then clicking on the Job ID of the job of interest. This results in the display of a matrix which contains dots only for the nodes allocated to the job. The loads of those nodes can be monitored in same way as for the system as a whole. |
Finally, the sum total of nodes allocated across all jobs to a user can be monitored by clicking on Username. Again, for both JobID and Username monitoring, clicking on a dot corresponding to a node results in a display of information about the node. |
|
On clusters like Biowlf, nodes are physically separated from one another with the exception of a high-level network link (Ethernet, Infiniband etc). Since it is impossible to "share" memory between processes on separate nodes, processes must explicitly pass memory, or "messages" between themselves over the network link via a common protocol. In most cases, the protocol used is MPI (Message Passing Interface). |
There are four possible network links available on Biowulf that parallel applications can be built for (see node configurations chart above for numbers and types of nodes) :
- Ethernet - most common, available on all nodes but is the least scalable. Most nodes run at 1 Gb/s.
- Infinipath - Available on a subset of nodes, 8 Gb/s very low-latency.
- Infiniband - Available on a subset of nodes, 16 Gb/s very low-latency.
For more information on MPI on Biowulf, see the Libraries and Development page. This page pertains mostly to developers but is also useful to users that want to know a little more about the compilers, MPI implementations and libraries available on Biowulf.
The Biowulf staff maintains two major Ethernet MPI implementations for our users. The first is MPICH2, an implementation developed at Argonne National Laboratories. The second is OpenMPI, a popular implementation with very active support and development. MPI-based software built and maintained by the Biowulf staff will use one of these implementations. The instructions here are generic, if you're running a program maintained by us, check the Applications Page for specific instructions on launching that application.
Running OpenMPI ApplicationsRunning an application built with OpenMPI for Ethernet requires only an "mpirun" command. On Biowulf, a qsub wrapper for launching such an application will look something like this:
#!/bin/bash #PBS -N test_job #PBS -k oe cd $PBS_O_WORKDIR /usr/local/OpenMPI/current/gnu/eth/bin/mpirun -n $np -machinefile $PBS_NODEFILE \ /full/path/to/executable > output_fileThe qsub submission will look something like this:
[janeuser@biowulf ~]$ qsub -v np=32 -l nodes=16:gige /path/to/qsub/script/qsub.shOpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.
Refer to Using the Biowulf Cluster for complete information on using the batch system.
Running MPICH2 Applications:Using MPICH on Biowulf requires one element of special setup: the mpd.conf file.
Setting the MPD password fileIf this is the first time you've used MPICH2 you'll need to create a file in your home directory (~/.mpd.conf) that consists of a single line. This command should do the trick:
[janeuser@biowulf ~]$ echo 'password=<password>' > ~/.mpd.confThis password is used internally by the process controllers and does not have to be remembered. Indeed, you'll quickly forget that it exists. Once it's created you'll need to set the permissions so that only you can read it:
[janeuser@biowulf ~]$ chmod 600 ~/.mpd.conf
Running the MPI processesOur MPICH2 installations use the MPD process manager to launch and control MPI processes. This requires three discrete steps to run a job: First, start the mpd managers on each node in your job with the "mpdboot" command. Then launch your job with "mpiexec." The last step is to tear-down your mpd managers with "mpdallexit."
Note: you will need to set your PATH to include the MPI installation used to build the application (the applications page entry for the program you're trying to run will tell you this if you did not build the program yourself). While users familiar with MPI will note that this is not strictly true, it is the only sure way to know you're getting the correct versions of the process managers and mpiexe scripts.
A minimal qsub script will look something like this:
#!/bin/bash
#PBS -N test_job
#PBS -k oe
export PATH=/usr/local/mpich2-intel64/bin:$PATH
mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l`
mpiexec -n $np /full/path/to/executable
mpdallexitThe qsub submission will look something like this:
[janeuser@biowulf ~]$ qsub -v np=8 -l nodes=4:gige /path/to/qsub/script/qsub.sh
A note on Infiniband vs. Infinipath: The Biowulf cluster has two different high-speed networks on two subsets of cluster nodes that are based on the Infiniband standard: Infinipath and Infiniband (often referred to as IPath and IB respectively). While the hardware that drives the two distinct networks is very similar, the underling software is not. Programs compiled for one will not work on the other (see the Development page under "MPI" for information on building for these target networks).
The Infinipath network was designed specifically as a message-passing network: the manufacturer implemented only those portions of the Infiniband standard necessary to create a network for parallel applications. The result is an ultra low-latency network with relatively high bandwidth (~8 Gb/s) that they chose to call "Infinipath" (Pathscale being the company that designed the product).
The Infiniband network is the more recent addition to the Biowulf cluster and it implements the full Infiniband standard. This network is also very low-latency (though theoretically not as low as the Infinipath network) with very high bandwidth (~16 Gb/s). The real-word performance difference between these two networks will often (perhaps usually) be negligible.
The Infinipath nodes on Biowulf use OpenMPI and a vendor-supplied MPI implementation which is grounded in code from the MPICH1 project. OpenMPI is the recommended implementation for new development and staff supplied programs will usually be built with it.
OpenMPI for InfinipathTo execute an MPI job on the Infinipath cluster via OpenMPI, you need to explicitly call the appropriate mpirun program:
#!/bin/bash #PBS -N MyJob #PBS -k oe cd $PBS_O_WORKDIR /usr/local/OpenMPI/current/gnu/ipath/bin/mpirun -machinefile $PBS_NODEFILE \ -np $np MyProg < MyProg.in >& MyProg.outThe qsub line will look something like this:
% qsub -v np=64 -l nodes=32:ipath myjob.batOpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.
Pathscale Infinipath MPIdepriciated: use OpenMPI above unless you have a dependancy on Pathscale-related run-times. This MPI uses /usr/bin/mpirun, found in the defaul PATH, to launch jobs. A qsub wrapper for an Infinipath job using the depriciated Infinipath MPI implimentation will look something like this:
#!/bin/bash #PBS -N MyJob #PBS -k oe cd $PBS_O_WORKDIR mpirun -machinefile $PBS_NODEFILE -np $np MyProg < MyProg.in >& MyProg.outA request for Infinipath nodes will look like this:
% qsub -v np=64 -l nodes=32:ipath myjob.batRefer to Using the Biowulf Cluster for complete information on using the batch system.
The Biowulf staff maintains two MPI implementations for the Infiniband network. This first is OpenMPI, the second is MVAPICH2, an implementation derived from MPICH2.
OpenMPI for InfinibandTo execute an MPI job on the Infiniband cluster via OpenMPI, you need to explicitly call the appropriate mpirun program:
#!/bin/bash #PBS -N MyJob #PBS -k oe cd $PBS_O_WORKDIR /usr/local/OpenMPI/current/gnu/ib/bin/mpirun -machinefile $PBS_NODEFILE \ -np $np MyProg < MyProg.in >& MyProg.outThe qsub line will look something like this:
% qsub -v np=128 -l nodes=16:ib myjob.batOpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.
Running Infiniband MPI apps using MVAPICH2As with the Ethernet MPICH implementation the user will need set up an mpd.conf file that includes a password for use between process managers. You can set up this file with the following commands. It's not necessary to maintain this file and it is not advisable to use a password that you may use anywhere else. A simple word or even blank string is all that's necessary since this is a closed cluster and process managers are assumed to be trusted.
% echo 'password=password' > ~/.mpd.conf % chmod 400 ~/.mpd.confA minimal batch command file for an Infiniband MPI job might look like this:
#!/bin/bash #PBS -N MyJob #PBS -k oe cd $PBS_O_WORKDIR mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l` mpiexec -n $np MyProg < MyProg.in >& MyProg.out mpdallexitThe batch job is submitted with a qsub command that might look something like this. Note that the "ib" property is used to request an Infiniband node and that the IB nodes have 8 cores each.
% qsub -v np=64 -l nodes=8:ib myjob.bat
Refer to Using the Biowulf Cluster for complete information on using the batch system.
See the Programming Tools and Libraries page.
Particularly with parallel programs, it's important that you benchmark your job to make sure subsequent runs perform with an acceptable degree of efficiency. Specific benchmarking information can be found, if applicable, on the applications page, if the program is maintained by us.
Users who run parallel programs on multiple nodes must run their own benchmarks to determine the optimal number of nodes. To run a benchmark, a short job should be run on 1, 2, 4, 8, 16, 32 etc. processors. The '/usr/bin/time' command should be used in the batch script to measure the wallclock usage, e.g.
/usr/bin/time mpirun -machinefile $PBS_NODEFILE -np $np /usr/local/gromacs/bin/mdrun_mpi \ -s topol.tpr -o traj.trr -c out.after_md -v > output
time for 1 processor Efficiency = --------------------------- * 100 n * time for n processorsThe optimal number of processors is the point where the efficiency is at least 70%. An example of the results from a benchmark can be found at the Biowulf NAMD benchmark page.
Biowulf nodes are heterogeneous with respect to architecture (x86_64, i686), processor speed, memory size, and networks. If you are running benchmarks, you will want to specify processor speed and memory size so that all your benchmark runs are on the same type of node:
qsub -l nodes=4:o2200:g4 myjob.bat # run on 2.8 GHz Opteron/4 GB nodes qsub -l nodes=4:o2800:m2 myjob.bat # run on 2.8 GHz Opteron/2 GB/core nodes