xCAT HPC Benchmark HOWTO (WIP)

It is generally a good idea to verify that the cluster you just built actually can do work.  This can be accomplished by running a few industry accepted benchmarks.  The purpose of benchmarking is not to get the best results, but to get consistent repeatable accurate results that are also the best results.

With a goal of "consistent repeatable accurate results" it is best to start with as few variables as possible.  I recommend starting with single node benchmarks.  e.g. STREAM.  If all machines have similar stream results, then memory can be ruled out as a factor with other benchmark anomalies.  Next work your way up to processor and disk benchmarks, then multinode (parallel) benchmarks, then finally HPL.  After each more complicated benchmark runs check for "consistent repeatable accurate results" before continuing.

Outlined below is a recommended path.

Single Node (serial) Benchmarks:

Parallel Benchmarks:

Prerequisites:

Feel free to follow your own path, but please start with the Sanity Checks and PBS IANS.

Sanity Checks

PBS IANS

Benchmarks

Sanity Checks

Hardware Check

No network, no cluster.  The stability of the network is critical.  As root type:

ppping noderange

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from.  ppping (parallel parallel ping) is an xCAT utility that will tell each node in the noderange to ping all the nodes in the noderange.  No news is good news, i.e. no output is good, only errors will be displayed.

If you have Myrinet, ssh to a Myrinet node and type:

ppping -i myri0 noderange

Where noderange is a list of all Myrinet nodes in the cluster.

Both tests are critical and must succeed, if not, you will never launch a job.

Setup User Environment

Never run benchmarks as root.  Use the addclusteruser command or equivalent to create a user on your primary user/login node.  If you are using NIS you are ready to go.  If you are using password synchronization then type:

cd /etc
prsync -craz passwd group noderange,-$(hostname -s):/etc

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from.

Verify with psh that all nodes have mounted the correct /home directory and that your added user is visible and that the directory permissions are correct:

psh -s noderange ls -l /home | grep username

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from and username is the user you are posing as.

PBS/Maui Check

Using pbstop and showq verify that all of your nodes are visible and ready.

Login as your sample user (e.g. bob) and test PBS/Maui.

bob@head01:~> qsub -l nodes=2,walltime=1:00:00 -I
qsub: waiting for job 0.head01.foobar.org to start
qsub: job 0.head01.foobar.org ready

----------------------------------------
Begin PBS Prologue Thu Dec 19 14:17:53 MST 2002
Job ID: 0.head01.foobar.org
Username: bob
Group: users
Nodes: node10 node9
End PBS Prologue Thu Dec 19 14:17:54 MST 2002
----------------------------------------


Note the Nodes: line.  Try to ssh from node to node and back to the user node that started qsub:

bob@node10:~> ssh node9
bob@node9:~> exit
logout
Connection to node9 closed.
bob@node10:~> ssh head01
bob@head01:~> exit
logout
Connection to head01 closed.
bob@node10:~> exit
logout

qsub: job 0.head01.foobar.org completed

Now try to ssh back to the nodes that were assigned, you should be denied:

bob@head01:~> ssh node9
14653: Connection close by 199.88.179.209

Software Install/Check

The following is the required base software.  Benchmark software installation will be covered in the Benchmark section of this document.

For best results use Intel Compilers for IA64, PGI Compilers for x86_64, and for IA32 use Intel Compilers or PGI.  I have also had good results using PGI for just Fortran and GCC 3.3 for C on x86_64.  Before building or downloading GCC 3.3 verify that is it not already installed:

# rpm -qa | grep gcc

You may need to use the full path to gcc33, e.g.:

# /opt/gcc33/bin/gcc


Intel 7.1 Compilers (IA32/IA64)
PGI 5.0 Compilers (Opteron/IA32)
GNU Compilers
Intel Math Kernel Libraries 6.0
AMD Core Math Libraries
Goto Libraries
ATLAS Libraries
MPICH
MPICH-GM

Intel 7.1 Compilers

PGI 5.0 Compilers

GNU Compilers

Intel Math Kernel Libraries 6.0

AMD Core Math Libraries

Goto Libraries

ATLAS Libraries

Optimally building ATLAS is beyond the scope of this document, however current binary distributions are available.  It is recommended that you try both.

MPICH

MPICH is a freely available MPI implementation that runs over IP.  There are many ways to build MPICH.  Free free to build it your way or the xCAT way:

MPICH-GM

MPICH-GM is a freely available MPI implementation that runs over Myrinet.  There are many ways to build MPICH-GM.  Free free to build it your way or the xCAT way:

PBS IANS

If you are already familiar with starting and monitoring jobs through PBS you may still want to read this section to understand how PBS is setup by xCAT.

Attributes

xCAT and PBS share the same attributes.  xCAT attributes are stored in $XCATROOT/etc/nodelist.tab and PBS attributes are stored in /var/spool/pbs/server_priv/nodes.  The PBS nodes file is generated by xCAT's makepbsnodefile command.

In this document will assume that you assigned an attribute of ia64 to all IA64 nodes and an attribute of x86 to all ia32/i686 nodes.  Examples using the attribute of compute refer to any compute node (ia64 or x86).

It will be necessary to edit supplied PBS scripts with the correct node assignment attributes for your cluster.

Submitting a Job

qsub is the PBS command to submit a job.  Job submission is not allow for root.  At a minimum you must specify how many nodes for how long.

To request (16) IA64 nodes with 2 processors/node, for 10 minutes interactively and assuming that PBS has ia64 assigned as an attribute to all the IA64 nodes, type:

$ qsub -l nodes=16:ia64:ppn=2,walltime=10:00 -I

To request 10 nodes of any attribute with 2 processors/node for 24 hours to run your PBS script foobar.pbs, type:

$ qsub -l nodes=10:ppn=2,walltime=24:00:00 foobar.pbs

Monitoring and Job Output

Use the tools pbstop, qstat, and showq to monitor PBS/Maui.

If PBS has the appropriate patches and your home directory contains a .pbs_spool subdirectory, then you can monitor your job stdout and stderr real-time with tail -f.  If you do not have this patch applied your output will be in the directory submitted from in the format: jobname.ojobnumber for stdout and jobname.ejobnumber for stderr.  You will have to wait for the job to complete before the files will be present.

PBS Help

RTFM.

E.g.

$ man qsub

Advice

For each benchmark run a small test first to test out all the scripts interactively, then a small test through PBS/Maui before submitting larger tests.

E.g. interactively:

$ cd ~/bench/benchmark
$ qsub -l nodes=2,walltime=1:00:00 -I
qsub: waiting for job 0.head01.foobar.org to start
qsub: job 0.head01.foobar.org ready

----------------------------------------
Begin PBS Prologue Thu Dec 19 14:17:53 MST 2002
Job ID: 0.head01.foobar.org
Username: bob
Group: users
Nodes: node10 node9
End PBS Prologue Thu Dec 19 14:17:54 MST 2002
----------------------------------------

node10:~> cd $PBS_O_WORKDIR
node10:~/bench/benchmark> ./benchmark.pbs
node10:~/bench/benchmark> exit

Benchmarks

STREAM

"The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels." -- http://www.cs.virginia.edu/stream/ref.html

IANS, STREAM just bangs on memory.  TRIAD results are usually what to look at.  STREAM is not a general purpose memory exerciser and utilizes little memory.  However if you do have memory anomalies, STREAM can be effected.

STREAM results can be effected by:

When analyzing STREAM output only compare apples-to-apples.  Every system setting (BIOS) must be identical.  Every hardware setting, configuration, and location of DIMMs must be identical.  Exact clones.

Building and running STREAM:

NPB Serial

"
The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

The NAS Serial Benchmarks are the same as the NAS Parallel Benchmarks except that MPI calls have been taken out and they run on one processor.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM results should be achieved first.

NPB OpenMP

"The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

The NAS Serial Benchmarks are the same as the NAS Parallel Benchmarks except that MPI calls have been replaced with OpenMP calls to run multiple processor on a shared memory system.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM and NPB Serial results should be achieved first.

bonnie++

"Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance. Then you can decide which test is important and decide how to compare different systems after running it.." -- http://www.coker.com.au/bonnie++

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM and NPB Serial results should be achieved first.

Ping-Pong

Ping-Pong is a simple benchmark that measures latency and bandwidth for different message sizes.  There are many ping-pong benchmarks available.  The origin of the pingpong.c included with xCAT is unknown.

Pallas MPI Benchmark

Pallas MPI Benchmark (PMB) provides a concise set of benchmarks targeted at measuring the most important MPI functions.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM, NPB Serial, and Ping-Pong results should be achieved first.

NAS Parallel

"The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM and NPB Serial results should be achieved first and PMB should pass.

High Performance Linpack

"HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark." -- http://www.netlib.org/benchmark/hpl

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM and NPB Serial and Parallel results should be achieved first and PMB should pass.

Support

http://xcat.org


Egan Ford
egan@us.ibm.com
July 2003