Biowulf User Guide

		Search:

User Guide Contents

Programming tools/libraries

Benchmarking

About:

The NIH Biowulf cluster is a GNU/Linux parallel processing system designed and built at the National Institutes of Health and managed by the Helix Systems Staff. Biowulf consists of a main login node and 2300 compute nodes with a combined processor core count of over 12000. The computational nodes are connected to high-speed networks and have access to high-performance fileservers.

Citing Biowulf and Scientific Publications

top

The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the intramural program. In the past we have been successful in obtaining support by citing published work which involved the use of our systems.

When publishing data resulting from calculations done on the NIH Biowulf cluster, please use the following citation in the article:
This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Md. (http://biowulf.nih.gov).
Please report the publication of articles which made use of the cluster to the Helix Systems Staff by citing the reference in email to staff@biowulf.nih.gov.

Accounts

top

Biowulf accounts require a pre-existing Helix account. A Biowulf account can be obtained by registering on the web here. Biowulf accounts not accessed within a 60 day period are automatically locked as a security precaution. A user may request to have his account unlocked by sending email to staff@biowulf.nih.gov.

Logging in

top

The Biowulf system is accessed from the NIH network via ssh to biowulf.nih.gov. We recommend the PuTTY SSH client for Windows users as it is free, feature-rich, and widely used. Mac and Unix/Linux users can use the ssh client available from the command line in a terminal window.

Xwindows connection software is also recommended for GUI/visualization applications that are run on the login or computational nodes (e.g. x-povray or interactive SAS) display on your desktop workstation. We provide free Xwindows software for Macs and PCs to Helix Systems users.

Password Information

Users log in to Helix and Biowulf using their NIH domain username and password.

Shells

The Bourne-Again SHell (bash) is the default shell on most Linux systems including Biowulf. Other available shells, including csh and tcsh, are listed in /etc/shells. To change your default shell, use the command:

chsh -s shell_name

where shell_name is the full pathname of the desired shell (e.g., /bin/csh).

Home directories

Biowulf home directories are in a shared NFS (Network File System) filespace, therefore the access to files in your home directory is identical from any node in Biowulf. Note that your Biowulf home directory is the same directory as your Helix home directory.

Mail

E-mail can be sent from any node on Biowulf. However, all mail sent out from Biowulf will have a return address of helix.nih.gov (rather than the address of the node). In addition, mail sent to a user on a Biowulf node will be automatically forwarded to that user on helix.nih.gov. In other words, you can send, but not receive mail on Biowulf.

If you do not read mail on helix, be sure to set up a .forward file in your helix home directory.

You can communicate with the Biowulf/Helix staff by sending email to: staff@biowulf.nih.gov

Biowulf Hardware Configuration

top

Diagram of Biowulf's network

Node configurations:

# of nodes	processor cores per node	memory	network
24	16 x 2.6 GHz Intel E5-2670 hyperthreading enabled 20 MB secondary cache	256 GB	1 Gb/s Ethernet
328	12 x 2.8 GHz Intel X5660 hyperthreading enabled 12 MB secondary cache	24 GB	1 Gb/s Ethernet
352	8 x 2.67 GHz Intel X5550 hyperthreading enabled 8 MB secondary cache	320 x 24 GB 32 x 72 GB	1 Gb/s Ethernet
250	8 x 2.8 GHz Intel E5462 12 MB secondary cache	8 GB	1 Gb/s Ethernet 16 Gb/s Infiniband
1	32 x 2.5 GHz AMD 0pteron 8380 512 KB secondary cache	512 GB	1 Gb/s Ethernet
232	4 x 2.8 GHz AMD Opteron 290 1 MB secondary cache	8 GB	1 Gb/s Ethernet
471	2 x 2.8 GHz AMD Opteron 254 1 MB secondary cache	40 x 8 GB 226 x 4 GB 91 x 2 GB	1 Gb/s Ethernet 160 x 8 Gb/s Infinipath
289	4 x 2.6 GHz AMD Opteron 285 1 MB secondary cache	8 GB	1 Gb/s Ethernet
389	2 x 2.2 GHz AMD Opteron 248 1 MB secondary cache	129 x 2 GB 66 x 4 GB	1 Gb/s ethernet
91	2 x 2.0 GHz AMD Opteron 246 1 MB secondary cache	48 x 1 GB 43 x 2 GB	1 Gb/s Ethernet

All nodes are connected to a 1 Gb/s switched Ethernet network while sub-sets of nodes are on high-performance Infinipath or Infiniband networks. The hostnames for these interfaces are p2 through p2576. The hostname for biowulf's internal interface is biowulf-e0 (i.e., nodes wishing to communicate with the login machine, should use the name biowulf-e0 rather than biowulf).

The ethernet switches are Brocade (Foundry) FCX 648S, and FESx 448 switches. The backbone switch is a Brocade (Foundry) RX-32 using 10 Gb/s fiber links to the FCX/FESxs edge switches.

For those applications that can benefit from increased bandwidth and lower latency than Ethernet, there are two high performance switched networks available:

250 8-core nodes are on a DDR Infiniband network connected to a Voltaire ISR 2012 infiniband switch. DDR Infiniband provides 16 Gb/s bandwidth at very low latencies.
134 dual-processor nodes are connected to and Infiniband network via QLogic InfiniPath adapters. This network is similar to the above mentioned Infiniband network but with two major differences: it is slightly lower latency but runs at a lower bandwidth as well (8 Gb/s max).

The login node (biowulf.nih.gov) is an HP DL380 (two 2.67 GHz Xeon X5550, eight cores) with 24 GB RAM. It is a minimal system designed for log-in, light-weight tasks, 64-bit compiling/development and the submission of jobs to the batch system for runs on the cluster. No compute intensive jobs are permitted on the login node.

Disk Storage

top

Storage Contents

Summary of file storage options

/scratch

/data

Hierarchical Storage Management

Checking disk usage

Quota increase request

There are several options for disk storage on Biowulf; please review this section carefully to decide where to place your data. Contact the Biowulf systems staff if you have any questions.

Except where noted, there are no quotas, time limits or other restrictions placed on the use of space on the Biowulf system, but please use the space responsibly; even hundreds of terabytes won't last forever if files are never deleted. Disk space on the Biowulf system should never be used as archival storage.

Users who require more than the default disk storage quota should fill out the online storage request form.

Summary of file storage options

	Location	Creation	Backups	Speed	Space	Available from
/home	network (NFS)	with Helix account	yes	high	8 GB default quota	B,C,H
/scratch (nodes)	local	created by user	no	medium	30 - 130 GB dedicated while node is allocated (**)	C
/scratch (biowulf)	network (NFS)	created by user	no	low	500 GB shared	B,H
/data	network (NFS)	with Biowulf account	yes	high	100 GB default quota	B,C,H

H = helix, B = biowulf login node, C = biowulf compute nodes

(**) 3 TB on DL-785

/scratch (nodes)

Each Biowulf node has a directly attached disk containing a /scratch filesystem. Note that scratch space is not backed up to tape, and thus, users should store any programs and data of importance in their home directories. Use of /scratch on the batch nodes should be for the duration of your job only. It is your job's responsibility to check for sufficient disk space. Your job may delete any and all files from /scratch to make space available (use the clearscratch command). Please use /scratch instead of /tmp for storage of temporary files.

/scratch on the login node (biowulf.nih.gov) is actually a shared (NFS) filesystem, accessible from both the login node and Helix. Files on this filesystem which have not been accessed for 14 days are automatically deleted by the system.

/data

These are RAID-6 filesystems mounted over NFS from one of the following, all of which are configured for high availability: two NetApp FAS3050 and eight FAS3170 filers, and a DataDirect Networks S2A9900 storage system with eight fileservers. This storage offers high performance NFS access, and is exported to Biowulf over a dedicated high-speed network. /data is accessible from all computational nodes as well as Biowulf and Helix, and will be the filesystem of choice for most users to store their large datasets. Biowulf users are assigned an initial quota of 100 GB on /data; please contact the Biowulf staff if you need to increase your quota.

Note: your /data directory is actually physically located on filesystems named /spin1, /gs1, or /gs2. The /data directory consists of links to one of those filesystems. ALWAYS refer to your data directory through the /data links as opposed to the physical location because the physical location is subject to change based on administrator needs. In other words, use /data/yourname rather than (for example) /gs1/users/yourname in your scripts.

In addition to all data being regularly backed up to tape, snapshots are available that allow users to recover files that have been inadvertently deleted. For more information on backups and snapshots, please refer to the File Backups section on the Helix website.

If a user's data directory is located on the /spin1 storage area, it is possible that the size of snapshots in the user's data directory may cause a user to have less available storage space than he believes he should.

Each data directory on /spin1 has its own snapshot space that is separate from a user's regular file or data space. However, if a user deletes too much data within a short period of time (a week or less), it is possible that the snapshot space for that data directory can be exceeded. When this occurs, a user's regular file space (allocated quota space) will be used as additional snapshot space. Unfortunately there is no way for a user to either determine whether he has encountered this situation or to actually delete his snapshots. However, if a user does not see an expected increase in available space in the output of the 'checkquota' command following the cleanup of large files, the possibility exists that he has encountered this situation and should contact the Helix/Biowulf staff for assistance.

Checking your disk storage usage

Use the checkquota command to determine how much disk space you are using:

$ checkquota

Mount                   Used      Quota  Percent    Files    Limit 
/data:               70.4 GB   100.0 GB   70.41%   307411  6225917
/data(SharedDir):    11.4 GB    20.0 GB   56.95%     6273  1245180
/home:                2.0 GB     8.0 GB   25.19%    11125      n/a
mailbox:             74.7 MB     1.0 GB    7.29%

Using the Biowulf Cluster

top

Contents

Common terms

Batch system

Job submission/management

Very Large Memory Jobs

The Biowulf Cluster is highly heterogeneous with respect to processor speed, memory size and networks. Nodes can contain between 2 and 32 processor cores with memory sizes from 1G to 512G, on message-passing networks that range from very fast (16 Gb/s) to moderate (1 Gb/s) in bandwidth. This can initially create confusion on the part of users in deciding on what nodes the need to run their work. However, once the cluster hardware is understood it should be quite easy for users to ascertain what nodes are suitable for their purpose without over-using resources or denying resources to others.

Here are some common errors that beginners make when using the Biowulf cluster:

Running fewer processes on nodes than there are processors. Please make sure you understand that the cluster nodes each have 2,4 or 8 processors, and that you need to be using all of them to maximize your productivity. There is one exception: when you need more memory per process than is possible to allocate without leaving processors idle (e.g. you need 8G of RAM per process to do your job - you can allocate dual-processor 8G nodes, leaving one processor idle.)
Not monitoring your jobs. If your MPI job is running poorly because it doesn't scale well to the requested number of processors, then you're wasting resources and denying them to others. This applies to single-process jobs as well. If you're not sufficiently loading the system(s) you're running on, then you need to know about it to correct the problem(s).
Consuming resources with minimal gain. Please keep jobs running efficiently: doubling the number of processors in a parallel job to take 3% off of your total wall-time is not efficient computing. Cluster space/time is valuable and it is the goal of this staff to maximize work done on the cluster by all users.
Not understanding the batch system. Please read the documentation on this page so that you understand how to request relevant resources for a job while avoiding the irrelevant ones (e.g. requesting Infinband nodes when not using the Infiniband network or large-memory nodes when running small-memory tasks).
Not asking for help. When in doubt, drop us an email or give us a call, it's what we're here for.

Common Terms:

Clarification of terms used in this guide.

Node: A node is a computer containing one or more processors. Sometimes simply called a "machine" or "box".
Processor: A processor is a cpu, containing one or more "cores". Each core can run application process at any given time.
process: The execution instance of an application. An MPI program will typically consist of 2 or more processes. Some program processes can have more than one thread of execution.

Node categories on the Biowulf cluster:

Login nodes: These are nodes used for program development, compiling, debugging and submission of jobs to the batch system. There is currently one login node (biowulf.nih.gov). Please do not run compute-intensive code on login nodes.
Batch nodes: the vast majority of nodes in the cluster are used for production jobs run under the control of the batch queueing system. These nodes are dedicated to your job.

Using the Batch System on Biowulf

Batch (or queueing) systems allow the user to submit jobs to the cluster without regard for what resources (cpu, memory, networks) are available. When those resources become available the batch system will start the job. There are many batch systems available; the one used on the Biowulf cluster is Altair's PBS Pro.

Submitting Jobs

In order to run a job under batch you must submit a script which contains the commands to be executed. The command to do this is qsub. The qsub command on Biowulf is heavily customized and this documentation should be used in lieu of that provided by Altair. The batch script (or batch command file) is simply a Linux shell script with optional commands to PBS (called directives).

Sample Batch Script

A very minimal batch script example:

#!/bin/bash
#
# this file is myjob.sh
#
#PBS -N MyJob
#PBS -m be
#PBS -k oe
#
cd $PBS_O_WORKDIR
myprog < /data/me/mydata

One could simply run this as a shell script on the command line, however when executed by the batch system, the lines beginning with #PBS have special meaning. These directives are as follows:

-N: name of the batch job.
-m: send mail when the job begins ("b") and when it ends ("e").
-k: keep the STDOUT ("o") and STDERR ("e") files.

Other directives can be found on the qsub man page.

The environment variable $PBS_O_WORKDIR is the directory which was the current directory when the qsub command was issued.

Note that if the "myprog" program above ordinarily sends its output (STDOUT) to the terminal, that output will appear in a file called MyJob.o<jobnumber>. Any errors (STDOUT) will appear as MyJob.e<jobnumber>.

Submitting a Job with qsub

The simplest form of the qsub command is:

% qsub -l nodes=1 myjob.bat

The resource list (-l) here consists of the number of nodes (1). You must specify a number of nodes, even if it is one.

The batch system allocation is done by nodes (not processors). When you are allocated a node by the batch system, your job is the only job running on the node. Since a node may contain between 2 and 24 processing cores, you should ensure that you utilize the node fully (see the swarm program below).

Parallel jobs running on more than 2 cpus will need to specify more than 1 node. Additionally, since you will want a parallel job running on nodes of identical clock speed, you will probably want to specify a node type as well:

qsub -l nodes=8:o2800 myparalleljob

Node types currently supported are (from fastest to slowest):

x2800   2.8 GHz Xeon
e2666   2.67 GHz Xeon
o2800   2.8 GHz Opteron
o2600   2.6 GHz Opteron
o2200   2.2 GHz Opteron
o2000   2.0 GHz Opteron

Nodes can also be selected based on memory size. There are two ways to select nodes based on memory

mN: the amount of memory per processor core, where N is 1, 2, or 4
gN: the total amount of memory in a node, where N is 4, 8, 24 or 72

Specifying memory per core. Use the "m" property where is the size of memory in gigabytes. Most nodes in the cluster are "m1" (ie, 1 GB/core). You should use this property for most jobs requiring a specific memory size. It is particularly useful for submitting swarms, since 'swarm' may place your jobs on nodes with differing numbers of cores. This property ensures that each process will have access to the same amount of memory regardless of the number of cores on the node.

Specifying memory per node. Use the "g" property where is the total memory in the node in gigabytes. This property should be used when your job requires the entire memory of the node for a single process.

Examples:

m2      2 GB/core
m4      4 GB/core
g4      4 GB/node
g72     72 GB/node

(Note: Since the primary purpose of the g72 nodes is to provide a large memory, it is permissible to underutilize the number of processors. That is, it's ok to run less than 8 processes per node).

Nodes can also be selected based on network:

gige    gigabit ethernet
ib      infiniband

Properties based on cores per node:

c2      2 cores/node
c4      4 cores/node
c16     16 cores/node
c24     24 cores/node

Note: Don't make additional specifications unless you need to! In most cases, the less you specify the better. The more flexibility the scheduler has in allocating nodes for you job, the less likely your job will wind up having to wait for specific resources.

Not all combinations of processor, memory and network are available. The chart below shows available configurations (in green) along with the PBS node properties (in bold). To see how many nodes are available in each configuration, use the freen command (below).

Processor	Cores	m1 1 GB mem. per core	m2 2 GB mem. per core	m4 4 GB mem. per core	g4 4 GB total mem.	g8 8 GB total mem.	g24 24 GB total mem.	g72 72 GB total mem.	Network
x2800 2.8 GHz Xeon	c24								gige Gigabit Ethernet
e2666 2.67 GHz Xeon	c16
o2800:dc 2.8 GHz Opteron	c4
o2600:dc 2.6 GHz Opteron	c4
o2800 2.8 GHz Opteron	c2
o2200 2.2 GHz Opteron	c2
o2000 2.0 GHz Opteron	c2
2.8 GHz Xeon	8								ib Infiniband
2.8 GHz Opteron	2								ipath Infinipath

Additional Notes:

At this time, any node the system allocates to you will be a 2 core node. You have to explicitly request a 4-, 8-, 16- or 24-core node with the appropriate "c" property. The exception to this is when nodes are allocated by the swarm command, which ensures that the correct number of processes are started on each type of node.
Your job may not run properly if your startup file (.bashrc, .cshrc, etc) contains commands which attempt to set terminal characteristics or do output to STDOUT. Such commands can be skipped by testing the environment variable PBS_ENVIRONMENT:
```
#csh
if ( ! $?PBS_ENVIRONMENT ) then
   do terminal stuff here
endif
```
If your default shell is csh or tcsh, your job.o###### files will contain the warning:
```
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
```
This warning can be ignored. Alternatively, you can change your shell to bash (with the chsh command) or tell qsub to use a different shell with:
```
qsub -l nodes=1 -S /bin/sh myscript
```

Very Large Memory Jobs

For jobs requiring between 72 and 512 GB of memory there are twenty-five very large memory (VLM) nodes: 24 are configured with 256 GB and a single one with 512 GB. Each node has 32 cores and the 512 GB node has a 3 TB local /scratch. In contrast to most nodes, these nodes can be shared amongst jobs. To allocate VLM nodes the memory size requested, and optionally, the number of processors requested must be specified to qsub:

% qsub -l nodes=1,mem=128gb myvlmjob

% qsub -l nodes=1,mem=384gb,ncpus=4 bigjob.bat

Jobs requiring between 72-250 GB allocate a 256 GB node and jobs larger than 250 GB allocate the 512 GB node.

Other Options for qsub

-r y|n       this switch controls whether your job is restartable.  If 
             one of the nodes that your job is running on should hang,
             the operations staff will often restart the job.  If you
             do not want your job to be restarted, use "-r n" with qsub.

-v varlist   this switch allows you to pass one or more variables from
             the qsub command line to your batch script.  For instance,
             with "qsub -v np=4 myjob", the myjob script can use the $np
             variable with a value of 4.  You can also list an environment
             variable without a value, and that envvar will be exported
             with its current value.

-I           for interactive jobs, see below.

-V           export all environment variables to the batch script.  Useful
             when running X11 clients on interactive nodes, see below.

See the qsub man page for additional options.

Allocating Nodes for Interactive Use with PBS

Using the batch system for interactive use may sound like an oxymoron, but this allows you to run compute-intensive processes without overloading the Biowulf login node.

The following example shows how to allocate an interactive node:

biobos$ qsub -I -V -l nodes=1
qsub: waiting for job 2011.biobos to start
qsub: job 2011.biobos ready
p139$ exit
logout
qsub: job 2011.biobos completed
biobos$

Note the following:

The qsub command, used with the -I switch, automatically logs you in to the first allocated node.
Use the -V switch to copy your env variables to the interactive batch session (useful when running X11 applications)
As with all batch jobs, the node is dedicated to the user's job.
The default allocation is a 2-core, 4 GB node. Other nodes may be allocated using the appropriate node specification to the qsub command.
The maximum allowable number of interactive jobs is 2.
By default, the job will be terminated after 8 hours. You may extend this time to a maximum of 36 hours by using the "walltime" attribute to the qsub command.
The -V switch exports all current environment variables to the batch session. This is required in order to run an X11 client on the interactive node.
As in batch, the $PBS_NODEFILE variable is the name of a file which contains the allocated nodes.
The nodes are unallocated when the interactive session is ended using the exit command. Please remember to exit your interactive sessions so that node resources are freed up.

Other PBS Commands

See the man pages for details on the following PBS commands:

qstat -u [username]            list jobs belonging to username
qdel [jobid]                   delete job jobid
qdel -Wforce [jobid]           delete a job on a hung node
qselect -u [user] -s [state]   select jobs based on criteria

For example, to delete all of your current jobs:

qdel `qselect -u myname`

To delete only your queued jobs:

qdel `qselect -u myname -s Q`

Resource Limits

There are no time limits for jobs running in the batch queue. However, while debugging a program, or if there is otherwise a possibility that your job could "runaway" due to a programming error, please use the walltime switch to limit the time your job can run before it is terminated by the batch system. For example, to limit your job to 72 hours use "-l walltime=72:00:00" as an argument to the qsub command.

Scheduling on Biowulf is done using a Fair Share algorithm. This means that, when more jobs are waiting to run than can be started, the next job to run will be the one belonging to the user with the least amount of system usage during the previous 7 days. This should allow users to submit as many jobs to the queue as they would like without concern that they will take an unfair amount of processing time.

The batch system enforces a maximum number of jobs and cpus (or cores) allocated to each user. This number can vary depending on system load and other factors. To see the current limits, use the batchlim command:

$ batchlim
Batch system max jobs  per user:  4000
Batch system max cores per user:  1280

            Max Cores
Queue        Per User
----------- ---------
norm            1280
ib               384
vol              256
norm3            128
g72              128
ipath            128
gpu               96
volib             64

            Max Mem    Max Mem
            Per User   on System
            ---------- -----------
DL-785 mem  512gb      512gb

Most user jobs run in the norm queue, so in the example above, the maximum per user allocation is 300 cores (which could be 150 x 2p nodes or 75 x 4p nodes).

Note: You should never specify a queue when submitting a job. Jobs will be automatically directed to the appropriate queue by the batch system. You should only specify the minimum resources needed by your job, e.g. memory or cores.

Running Multiple Serial Jobs and the swarm Command

Because each Biowulf node has between two and 24 processor cores, but the PBS batch system allocates nodes rather than cores, submitting multiple single process jobs to the system results in a poor utilization of processors (i.e., only one processor per node).

In the case of simply running two processes on a single dual-processor node, the following example uses the wait command to prevent the batch command script from exiting before the application processes have finished:

#!/bin/bash
#
# this file is myjob.sh
#
#PBS -N MyJob
#PBS -m be
#PBS -k oe
#
myprog -arg arg1 < infile1 > outfile1 &
myprog -arg arg2 < infile2 > outfile2 &
wait

Note how this script runs 2 instances of a program by putting them in the background (using the ampersand "&"), and then using the shell wait command to make the script wait for each background process before exiting.

When running many single-threaded jobs, setting up many batch command files can be cumbersome. The swarm command can be used to automatically generate batch command files and submit them to the batch system.

The swarm Command

swarm allows you to submit an arbitrary number of serial jobs to the batch system by simply creating a command file with one command per line. swarm automatically creates batch command files and submits them to the batch system. Two commands are submitted for each node, making optimal use of the processors.

Here is an example command file:

myprog -param a < infile-a > outfile-a
myprog -param b < infile-b > outfile-b
myprog -param c < infile-c > outfile-c
myprog -param d < infile-d > outfile-d
myprog -param e < infile-e > outfile-e
myprog -param f < infile-f > outfile-f
myprog -param g < infile-g > outfile-g

The command file is submitted using the following command:

swarm -f cmdfile

The result is 4 jobs submitted to the batch system, 3 jobs with 2 processes each, and the last with a single process.

When submitting a very large swarm (1000s of processes), the bundle option to swarm should be used:

swarm -f cmdfile -b 40

If cmdfile contains 2500 commands, approximately 63 bundles of 40 commands each will be created and submitted as 32 batch jobs (2 bundles per job, one for each processor on a node).

Note1: swarm will correctly allocate the correct number of processes to 2-, 4-, 16- or 24-core nodes.

Note2: if the programs you are running via swarm either (1) require more than 1 GB memory, or (2) are multi-threaded, you must specify additional information to swarm.

See the swarm documentation for more details.

Resubmitting part of a swarm: If a few jobs in your swarm fail to complete, you can determine the batch script associated with those jobs using the 'submitinfo' command. You can then resubmit just those jobs. For example, 2 jobs in this swarm, with job ids 4104732 and 4104738, did not complete successfully due to a typo in the original swarm command file. You can edit the batch scripts and resubmit just these two jobs.

biowulf% submitinfo 4104732
4104739:  (/home/susanc) qsub -l nodes=1:c16 ./.swarm/ctrl-swarm48n32646

biowulf% submitinfo 4104738
4104738:  (/home/susanc) qsub -l nodes=1:c16 ./.swarm/ctrl-swarm47n32646

[...edit the 2 batch scripts ctrl-swarm48n32646 and ctrl-swarm47n32646...]

biowulf% qsub -l nodes=1:c16 /home/susanc/.swarm/ctrl-swarm48n32646
410836.biobos
biowulf% qsub -l nodes=1:c16 /home/susanc/.swarm/ctrl-swarm47n32646
410837.biobos

The multirun command

If you wish to submit more than 2 single-threaded jobs but want them under control of a single job, then an mpi "shell" can be used (note: In many cases this will not be an optimal use of resources. Unless all processes exit at roughly the same time, idle nodes will not be freed by the batch system until the last process has exited).

The basic procedure is as follows (generation of these scripts can be done automatically by writing a higher order script):

1. Create an executable shell script which will run multiple instances of your program. Which will run depends on the "mpi task id" of the instance.

#!/bin/tcsh
#
# this file is run6.sh
#
switch ($MP_CHILD)
 case 0:
 your_prog with args0
breaksw
case 1:
your_prog with args1
breaksw
case 2:
your_prog with args2
breaksw
case 3:
your_prog with args3
breaksw
case 4:
your_prog with args4
breaksw
case 5:
your_prog with args5
breaksw
endsw

2. Use mpirun in your batch command file to run the mpi shell program (multirun):

#!/bin/tcsh
#
# this file is myjob.sh
#
#PBS -N MyJob
#PBS -m be
#PBS -k oe
#
set path=(/usr/local/mpich/bin $path)
mpirun -machinefile $PBS_NODEFILE -np 6 \
 /usr/local/bin/multirun -m /home/me/run6.sh

3. Submit the job to the batch system:

qsub -l nodes=3 myjob.sh

This job will run 6 instances of the program on 3 nodes.

File Transfers

A modified version of the 'scp' file transfer program called 'scp-hpn' is available for high-performance transfers averaging over 100 MB/s on 1 Gb/s network paths. Since the program requires connecting to a special network port on the biowulf login node, the client should be installed on your workstation or server and files pushed/pulled from that system.

Download scp-hpn from http://biowulf.nih.gov/hpn.tar.gz (or hpn_i386.tar.gz for 32-bit systems), place the hpn.tar.gz file in the top level of your home directory on your Linux workstation or server and run the command "tar xfvz hpn.tar.gz". See the README file included in the distribution for additional details.

Please note that the provided version of the program was built on Biowulf which runs RedHat Linux 5. As such, it will work fine on other RedHat or derivative linux distributions. However, it is not guaranteed to work on other Linux distributions (or Mac) and you will likely need to build your own program binaries if you run a non-RedHat system. Additional information may be found here: http://www.psc.edu/networking/projects/hpn-ssh/.

Licensed Software

Some licensed products such as Matlab are available on the Biowulf cluster. There are a limited number of licenses. Thus, each user who wants to submit a job that requires a license needs to specify that resource on the qsub command line. For example, a job requiring a Matlab license would be submitted as

biowulf% qsub -l nodes=1,matlab=1 jobscript

The individual application pages contain details about the resources that need to be specified.

The utility 'licenses' can be used on Biowulf to get a list of total and free licenses. Some licenses are limited so that an individual user can only run a certain number of simultaneous jobs; these limits are also listed in the 'licenses' output.

[user@biowulf]$ licenses
 License                     Resource         Total    Free  Per-User Limit
---------------------------------------------------------------------------
Mathematica                  math                 7      5      4
Matlab                       matlab              23     20      5
Bioinformatics Toolbox       matlab-bio           2      0      -
Compiler                     -                    1      1      -
Curve Fitting Toolbox        matlab-curve         3      2      2
Identification Toolbox       matlab-ident         2      2      -
Image Toolbox                matlab-image         4      3      2
Neural Network Toolbox       matlab-neural        2      2      -
Optimization Toolbox         matlab-opt           4      3      2
PDE Toolbox                  matlab-pde           2      2      -
Signal Toolbox               matlab-signal        6      6      3
SimBiology Toolbox           matlab-simbio        1      1      -
Simulink                     matlab-simulink      1      1      -
Statistics Toolbox           matlab-stat          7      4      4
Symbolic Toolbox             matlab-symb          2      2      -
Wavelet Toolbox              matlab-wave          2      2      -
SAS                          sas                 48     48     48

Monitoring Your Jobs

top

Once a batch job has been submitted it can be monitored using both command line and web-based tools.

Using either the jobload command or the user job monitor (see below), you can determine the overall behavior of your job based on the load of each node. Perfectly behaving jobs will have loads of near 100% on all nodes. If the nodes in a parallel job are running at loads below 75% (or are green), the job probably isn't scaling well, and you should rerun the job with fewer nodes. Loads of less than 75% for non-parallel jobs may mean that not all processors are being used, or that the job is i/o bound. Contact the Biowulf staff for help in deciding whether your job is making best use of its resources.

The sections below provide information about various ways of monitoring your jobs.

qstat

The qstat command reports the status of jobs submitted to the batch system. 'qstat -a' shows the status of all batch jobs; 'qstat -n' shows, in addition, the nodes assigned to each job. Qstat with the "-f" switch gives detailed information about a specific job, including the assigned nodes and resources used. See the man page for 'qstat' for more details.

freen

The freen command can be used to determine the currently available nodes (free/total):

% freen
Free Nodes by memory-per-core
      Cores     m0.5      m1      m2      m4   Total
------------------ DefaultPool -----------------------
o2800    2       /       /     13/211    /     13/211
o2200    2       /    117/300  62/62     /    179/362
o2000    2      7/42   38/38     /       /     45/80 
------------------ Infiniband  -----------------------
ib       8       /     24/250    /       /     24/250
ipath    2       /       /     69/126    /     69/126
------------------  OnDemand   -----------------------
e2666    8       /       /      0/320    /      0/320
o2800:dc 4       /       /     19/226    /     19/226
o2800    2       /     53/90     /     40/40   93/130
o2600:dc 4       /      0/41   37/212    /     37/253


Free nodes by total node memory
      Cores       g4      g8     g24     g72   Total
------------------ DefaultPool -----------------------
o2800    2     13/211    /       /       /     13/211
o2200    2     62/62     /       /       /     62/62 
------------------  OnDemand   -----------------------
e2666    8       /       /      0/320  11/32   11/352
e2400    8       /       /      0/42     /      0/42 
o2800:dc 4       /     19/226    /       /     19/226
o2600:dc 4       /     37/253    /       /     37/253

--------------------  DL-785  ------------------------
Available: 31 processors, 504.2 GB memory

jobload

The jobload command is used to report the load on nodes for a job or for a user:

$ jobload juser
Jobs for  juser     Node    Cores   Load    

2383856.biobos      p1922     4     100%
2383858.biobos      p1926     4     100%
2383859.biobos      p1918     4     100%
2383862.biobos      p1920     4      99%
2383863.biobos      p1928     4      79%
2383865.biobos      p1919     4      85%
Core Total: 24     User Average:     93%

All but 2 nodes allocated to this user are running to capacity.

$ jobload 869186
869186.biobos       Node    Load    
                    p554    100%
                    p555    100%
                    p557    100%
                    p564    100%
                    p565    100%
                    p566    101%
                    p567    100%
                    p568    100%
            Job Average:    100%

This shows a well-behaved parallel job.

# jobload 763334
763334.biobos       Node    Load    
                    p1495    25%
                    p1497     0%
                    p1498     0%
                    p1500     0%
                    p1501     0%
                    p1503     0%
                    p1504     0%
                    p1507     0%
            Job Average:      3%

This last example shows a improperly running parallel job.

Monitoring memory usage

If your jobs require less than 0.5 GB of memory per process, they will run successfully on any Biowulf node. If they require a minimum per-core or total node memory, you will need to specify the node type to ensure that the jobs run on a node with sufficient memory.

Running jobload with the -m option will show the memory usage of your jobs.

biowulf% jobload -m user1
Jobs for  user1     Node    Cores   Load    Memory  
                                          Used/Total (G)
1898152.biobos      p1747     4      25%    5.7/8.1 
1898156.biobos      p1752     4      25%    4.7/8.1 
1898157.biobos      p1753     4      25%    5.9/8.1

The jobs above for user1 are appropriately sized for the allocated nodes.

biowulf% jobload -m user2
Jobs for  user2      Node    Cores   Load    Memory  
                                          Used/Total (G)
2016779.biobos      p889      8      12%    2.2/74.2
2016780.biobos      p895      8      12%    2.1/74.2
2016781.biobos      p887      8      12%    2.1/74.2
2016782.biobos      p881      8      12%    2.1/74.2

The jobs for user2 have allocated the g72 nodes, but are only using ~2 GB of memory on each node. This is an inappropriate use of resources.

Some jobs may fluctuate in their memory usage. The batch system will track the resources used for each job, and if the -m e flag to qsub is used, this information will be emailed to the user at the end of the job. The value resources_used.mem will report the peak memory used by the job. This value should be used to allocate nodes with appropriate memory.

For example, the job below was submitted with qsub -m be -l nodes=1:g8 myscript. At the end of the job, the user will get a mail message along the following lines:

From: adm 
Date: October 13, 2010 9:39:29 AM EDT
To: user3@biobos.nih.gov
Subject: PBS JOB 2024340.biobos

PBS Job Id: 2024340.biobos
Job Name:   memhog2.bat
Execution terminated
Exit_status=0
resources_used.cpupercent=111
resources_used.cput=00:18:32
resources_used.mem=7344768kb
resources_used.ncpus=1
resources_used.vmem=7494560kb
resources_used.walltime=00:17:05

The batch system reports that the job has used 7344768kb, or 7.3 GB. This job is appropriately sized for the 'g8' nodes.

jobcheck

The jobcheck command can be used to check the status of exited jobs. In particular it can be used to identify jobs that have been deleted due to their having exceeded node memory limits.

Show info for a particular job (the "D" indicates the job was probably deleted due to exceeding memory):

$ jobcheck 1019880
   Job ID        Job name      Walltime cpu%  MemLim MemUsed Estat
------------ ---------------- --------- ---- ------- ------- -----
1019880      sw3n27536         19:15:45  106   22592   22598   271 D

Show all jobs exiting with a non-zero exit status for the past 2 days:

$ jobcheck -d 2
Showing jobs for the past 2 days.
   Job ID        Job name      Walltime cpu%  MemLim MemUsed Estat
------------ ---------------- --------- ---- ------- ------- -----
1019880      sw3n27536         19:15:45  106   22592   22598   271 D
1019899      sw12n29227         1:00:25   96   22592   22721   271 D
1024544      sw1n25060          0:03:32  420   22592    8835   271

Show all jobs which have exited during the past day:

$ jobcheck -d 1 -a
Showing ALL jobs including those with a zero exit status.
Showing jobs for the past 1 days.
   Job ID        Job name      Walltime cpu%  MemLim MemUsed Estat
------------ ---------------- --------- ---- ------- ------- -----
1019880      sw3n27536         19:15:45  106   22592   22598   271 D
1024543      sw1n23515          4:08:50 1210   22592   10148     0
1024544      sw1n25060          0:03:32  420   22592    8835   271
1024588      sw1n26558          5:02:38 1409   22592   10339     0
1024599      sw1n27070          1:31:56  504   22592    6436     0

There are other options as well:

$ jobcheck
Usage:  jobcheck jobnumber
        jobcheck [-a] -d days
        jobcheck [-a] -s start [-e end]

where:  -a  show all jobs including zero exit status
        -d  last N days to search
        -s  datetime to begin search in 'YYYY-MM-DD hh:mm' format
        -e  datetime to end search (optional)
        -v  display start/end times

cluster monitor (web-based)

The web page at http://biowulf.nih.gov/sysmon lists several ways of monitoring the Biowulf cluster. The "matrix" view of the system shows the load of each node in discrete blocks of 100. For instance, block 17, row 3, column 4 (node p1734) shows a green dot, indicating that the system is 50% loaded (in this case a load of 2, where a load of 4 would be ideal). A healthy job would show yellow dots, an over-loaded job shows red and unused or severely under-loaded systems are indicated with a blue dot.

Biowulf load monitor showing the loads of each node on the cluster.

The color of the dot indicates the load percentage:

white - the node is down
blue - idle (load <10%)
green - load >10% <60%
yellow - load >60% <110%
red - load >110%

Rolling over a dot will expose the node name and the system load (2-processor systems are at 100% load when the system load is 2, 4 processor systems are at 100% at 4, 8-processors systems at 8, etc).

Clicking on the dot corresponding to a node will result in a display of the process, cpu, disk and memory status for that node (output from the top and df commands).

The status of specific batch jobs can be monitored by first clicking on List Status of Batch Jobs which gives output similar to qstat, and then clicking on the Job ID of the job of interest. This results in the display of a matrix which contains dots only for the nodes allocated to the job. The loads of those nodes can be monitored in same way as for the system as a whole.

Biowulf load monitor showing a job monitor matrix.

Finally, the sum total of nodes allocated across all jobs to a user can be monitored by clicking on Username.

Again, for both JobID and Username monitoring, clicking on a dot corresponding to a node results in a display of information about the node.

Running Parallel Jobs

top

Contents

Running MPI over Ethernet

Running MPI over Infinipath

Running MPI over Infiniband

On clusters like Biowlf, nodes are physically separated from one another with the exception of a high-level network link (Ethernet, Infiniband etc). Since it is impossible to "share" memory between processes on separate nodes, processes must explicitly pass memory, or "messages" between themselves over the network link via a common protocol. In most cases, the protocol used is MPI (Message Passing Interface).

There are four possible network links available on Biowulf that parallel applications can be built for (see node configurations chart above for numbers and types of nodes) :

Ethernet - most common, available on all nodes but is the least scalable. Most nodes run at 1 Gb/s.
Infinipath - Available on a subset of nodes, 8 Gb/s very low-latency.
Infiniband - Available on a subset of nodes, 16 Gb/s very low-latency.

For more information on MPI on Biowulf, see the Libraries and Development page. This page pertains mostly to developers but is also useful to users that want to know a little more about the compilers, MPI implementations and libraries available on Biowulf.

Running Ethernet MPI Applications

The Biowulf staff maintains two major Ethernet MPI implementations for our users. The first is MPICH2, an implementation developed at Argonne National Laboratories. The second is OpenMPI, a popular implementation with very active support and development. MPI-based software built and maintained by the Biowulf staff will use one of these implementations. The instructions here are generic, if you're running a program maintained by us, check the Applications Page for specific instructions on launching that application.

Running OpenMPI Applications

Running an application built with OpenMPI for Ethernet requires only an "mpirun" command. On Biowulf, a qsub wrapper for launching such an application will look something like this:

#!/bin/bash #PBS -N test_job #PBS -k oe cd $PBS_O_WORKDIR /usr/local/OpenMPI/current/gnu/eth/bin/mpirun -n $np -machinefile $PBS_NODEFILE \ /full/path/to/executable > output_file

The qsub submission will look something like this:

[janeuser@biowulf ~]$ qsub -v np=32 -l nodes=16:gige /path/to/qsub/script/qsub.sh

OpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.

Refer to Using the Biowulf Cluster for complete information on using the batch system.

Running MPICH2 Applications:

Using MPICH on Biowulf requires one element of special setup: the mpd.conf file.

Setting the MPD password file

If this is the first time you've used MPICH2 you'll need to create a file in your home directory (~/.mpd.conf) that consists of a single line. This command should do the trick:

[janeuser@biowulf ~]$ echo 'password=<password>' > ~/.mpd.conf

This password is used internally by the process controllers and does not have to be remembered. Indeed, you'll quickly forget that it exists. Once it's created you'll need to set the permissions so that only you can read it:

[janeuser@biowulf ~]$ chmod 600 ~/.mpd.conf

Running the MPI processes

Our MPICH2 installations use the MPD process manager to launch and control MPI processes. This requires three discrete steps to run a job: First, start the mpd managers on each node in your job with the "mpdboot" command. Then launch your job with "mpiexec." The last step is to tear-down your mpd managers with "mpdallexit."

Note: you will need to set your PATH to include the MPI installation used to build the application (the applications page entry for the program you're trying to run will tell you this if you did not build the program yourself). While users familiar with MPI will note that this is not strictly true, it is the only sure way to know you're getting the correct versions of the process managers and mpiexe scripts.

A minimal qsub script will look something like this:

#!/bin/bash
#PBS -N test_job
#PBS -k oe

export PATH=/usr/local/mpich2-intel64/bin:$PATH
mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l`
mpiexec -n $np /full/path/to/executable
mpdallexit

The qsub submission will look something like this:

[janeuser@biowulf ~]$ qsub -v np=8 -l nodes=4:gige /path/to/qsub/script/qsub.sh

Infinipath and Infiniband MPI

A note on Infiniband vs. Infinipath: The Biowulf cluster has two different high-speed networks on two subsets of cluster nodes that are based on the Infiniband standard: Infinipath and Infiniband (often referred to as IPath and IB respectively). While the hardware that drives the two distinct networks is very similar, the underling software is not. Programs compiled for one will not work on the other (see the Development page under "MPI" for information on building for these target networks).

The Infinipath network was designed specifically as a message-passing network: the manufacturer implemented only those portions of the Infiniband standard necessary to create a network for parallel applications. The result is an ultra low-latency network with relatively high bandwidth (~8 Gb/s) that they chose to call "Infinipath" (Pathscale being the company that designed the product).

The Infiniband network is the more recent addition to the Biowulf cluster and it implements the full Infiniband standard. This network is also very low-latency (though theoretically not as low as the Infinipath network) with very high bandwidth (~16 Gb/s). The real-word performance difference between these two networks will often (perhaps usually) be negligible.

Running Infinipath MPI apps

The Infinipath nodes on Biowulf use OpenMPI and a vendor-supplied MPI implementation which is grounded in code from the MPICH1 project. OpenMPI is the recommended implementation for new development and staff supplied programs will usually be built with it.

OpenMPI for Infinipath

To execute an MPI job on the Infinipath cluster via OpenMPI, you need to explicitly call the appropriate mpirun program:

#!/bin/bash #PBS -N MyJob #PBS -k oe cd $PBS_O_WORKDIR /usr/local/OpenMPI/current/gnu/ipath/bin/mpirun -machinefile $PBS_NODEFILE \ -np $np MyProg < MyProg.in >& MyProg.out

The qsub line will look something like this:

% qsub -v np=64 -l nodes=32:ipath myjob.bat

OpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.

Pathscale Infinipath MPI

depriciated: use OpenMPI above unless you have a dependancy on Pathscale-related run-times. This MPI uses /usr/bin/mpirun, found in the defaul PATH, to launch jobs. A qsub wrapper for an Infinipath job using the depriciated Infinipath MPI implimentation will look something like this:

#!/bin/bash #PBS -N MyJob #PBS -k oe cd $PBS_O_WORKDIR mpirun -machinefile $PBS_NODEFILE -np $np MyProg < MyProg.in >& MyProg.out

A request for Infinipath nodes will look like this:

% qsub -v np=64 -l nodes=32:ipath myjob.bat

Refer to Using the Biowulf Cluster for complete information on using the batch system.

Running Infiniband MPI apps

The Biowulf staff maintains two MPI implementations for the Infiniband network. This first is OpenMPI, the second is MVAPICH2, an implementation derived from MPICH2.

OpenMPI for Infiniband

To execute an MPI job on the Infiniband cluster via OpenMPI, you need to explicitly call the appropriate mpirun program:

#!/bin/bash #PBS -N MyJob #PBS -k oe cd $PBS_O_WORKDIR /usr/local/OpenMPI/current/gnu/ib/bin/mpirun -machinefile $PBS_NODEFILE \ -np $np MyProg < MyProg.in >& MyProg.out

The qsub line will look something like this:

% qsub -v np=128 -l nodes=16:ib myjob.bat

OpenMPI is very intelligent about run-time linking however it is necessary to provide the full path to mpirun when executing parallel programs. This explicitly tells OpenMPI to use the run-time from the same installation regardless of the system's default library paths. This is extremely helpful on systems like Biowulf which support many interconnects and architectures.

Running Infiniband MPI apps using MVAPICH2

As with the Ethernet MPICH implementation the user will need set up an mpd.conf file that includes a password for use between process managers. You can set up this file with the following commands. It's not necessary to maintain this file and it is not advisable to use a password that you may use anywhere else. A simple word or even blank string is all that's necessary since this is a closed cluster and process managers are assumed to be trusted.

% echo 'password=password' > ~/.mpd.conf % chmod 400 ~/.mpd.conf

A minimal batch command file for an Infiniband MPI job might look like this:

#!/bin/bash #PBS -N MyJob #PBS -k oe cd $PBS_O_WORKDIR mpdboot -f $PBS_NODEFILE -n `cat $PBS_NODEFILE | wc -l` mpiexec -n $np MyProg < MyProg.in >& MyProg.out mpdallexit

The batch job is submitted with a qsub command that might look something like this. Note that the "ib" property is used to request an Infiniband node and that the IB nodes have 8 cores each.

% qsub -v np=64 -l nodes=8:ib myjob.bat

Refer to Using the Biowulf Cluster for complete information on using the batch system.

Programming Tools and Libraries

See the Programming Tools and Libraries page.

Benchmarking

Particularly with parallel programs, it's important that you benchmark your job to make sure subsequent runs perform with an acceptable degree of efficiency. Specific benchmarking information can be found, if applicable, on the applications page, if the program is maintained by us.

Users who run parallel programs on multiple nodes must run their own benchmarks to determine the optimal number of nodes. To run a benchmark, a short job should be run on 1, 2, 4, 8, 16, 32 etc. processors. The '/usr/bin/time' command should be used in the batch script to measure the wallclock usage, e.g.

/usr/bin/time  mpirun -machinefile $PBS_NODEFILE -np $np /usr/local/gromacs/bin/mdrun_mpi \
      -s topol.tpr -o traj.trr -c out.after_md -v >  output

The wallclock time will then be reported in the job standard output. The efficiency should be calculated using the formula below for each data point.

                 time for 1 processor
Efficiency = --------------------------- * 100
              n * time for n processors

The optimal number of processors is the point where the efficiency is at least 70%. An example of the results from a benchmark can be found at the Biowulf NAMD benchmark page.

Biowulf nodes are heterogeneous with respect to architecture (x86_64, i686), processor speed, memory size, and networks. If you are running benchmarks, you will want to specify processor speed and memory size so that all your benchmark runs are on the same type of node:

qsub -l nodes=4:o2200:g4 myjob.bat   # run on 2.8 GHz Opteron/4 GB nodes

qsub -l nodes=4:o2800:m2 myjob.bat   # run on 2.8 GHz Opteron/2 GB/core nodes