xCAT HPC Benchmark HOWTO (WIP)
It is generally a good idea to verify that the cluster you just built actually can do work. This can be accomplished by running a few industry accepted benchmarks. The purpose of benchmarking is not to get the best results, but to get consistent repeatable accurate results that are also the best results.
With a goal of "consistent repeatable accurate results" it is best to start with as few variables as possible. I recommend starting with single node benchmarks. e.g. STREAM. If all machines have similar stream results, then memory can be ruled out as a factor with other benchmark anomalies. Next work your way up to processor and disk benchmarks, then multinode (parallel) benchmarks, then finally HPL. After each more complicated benchmark runs check for "consistent repeatable accurate results" before continuing.
Outlined below is a recommended path.
Single Node (serial) Benchmarks:
Parallel Benchmarks:
Prerequisites:
Feel free to follow your own path, but please start with the Sanity Checks and PBS IANS.
No network, no cluster. The stability of the network is critical. As root type:
ppping noderange
Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from. ppping (parallel parallel ping) is an xCAT utility that will tell each node in the noderange to ping all the nodes in the noderange. No news is good news, i.e. no output is good, only errors will be displayed.
If you have Myrinet, ssh to a Myrinet node and type:
ppping -i myri0 noderange
Where noderange is a list of all Myrinet nodes in the cluster.
Both tests are critical and must succeed, if not, you will never launch a job.
Never run benchmarks as root. Use the addclusteruser command or equivalent to create a user on your primary user/login node. If you are using NIS you are ready to go. If you are using password synchronization then type:
cd /etc
prsync -craz passwd group noderange,-$(hostname -s):/etc
Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from.
Verify with psh that all nodes have mounted the correct /home directory and that your added user is visible and that the directory permissions are correct:
psh -s noderange ls -l /home | grep username
Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from and username is the user you are posing as.
Using pbstop and showq verify that all of your nodes are visible and ready.
Login as your sample user (e.g. bob) and test PBS/Maui.
bob@head01:~> qsub -l
nodes=2,walltime=1:00:00 -I
qsub: waiting for job 0.head01.foobar.org to start
qsub: job 0.head01.foobar.org ready
----------------------------------------
Begin PBS Prologue Thu Dec 19 14:17:53 MST 2002
Job ID: 0.head01.foobar.org
Username: bob
Group: users
Nodes: node10 node9
End PBS Prologue Thu Dec 19 14:17:54 MST 2002
----------------------------------------
Note the Nodes: line.
Try to ssh from node to node and back
to the user node that started qsub:
bob@node10:~> ssh node9
bob@node9:~> exit
logout
Connection to node9 closed.
bob@node10:~> ssh head01
bob@head01:~> exit
logout
Connection to head01 closed.
bob@node10:~> exit
logout
qsub: job 0.head01.foobar.org completed
Now try to ssh back to the
nodes that were assigned, you should be denied:
bob@head01:~> ssh node9
14653: Connection close by 199.88.179.209
The following is the required base software. Benchmark software installation will be covered in the Benchmark section of this document.
For best results use Intel Compilers for IA64, PGI Compilers for x86_64, and for IA32 use Intel Compilers or PGI. I have also had good results using PGI for just Fortran and GCC 3.3 for C on x86_64. Before building or downloading GCC 3.3 verify that is it not already installed:
# rpm -qa | grep gcc
You may need to use the full path to gcc33, e.g.:
# /opt/gcc33/bin/gcc
Intel 7.1 Compilers (IA32/IA64)
PGI 5.0 Compilers (Opteron/IA32)
GNU Compilers
Intel Math Kernel Libraries 6.0
AMD Core Math Libraries
Goto Libraries
ATLAS Libraries
MPICH
MPICH-GM
Intel Math Kernel Libraries 6.0
Optimally building ATLAS is beyond the scope of this document, however current binary distributions are available. It is recommended that you try both.
MPICH is a freely available MPI implementation that runs over IP. There are many ways to build MPICH. Free free to build it your way or the xCAT way:
# export MPICHROOT=/usr/local/mpich
MPICH-GM is a freely available MPI implementation that runs over Myrinet. There are many ways to build MPICH-GM. Free free to build it your way or the xCAT way:
# export MPICHROOT=/usr/local/mpich
If you are already familiar with starting and monitoring jobs through PBS you may still want to read this section to understand how PBS is setup by xCAT.
Attributes
xCAT and PBS share the same attributes. xCAT attributes are stored in $XCATROOT/etc/nodelist.tab and PBS attributes are stored in /var/spool/pbs/server_priv/nodes. The PBS nodes file is generated by xCAT's makepbsnodefile command.
In this document will assume that you assigned an attribute of ia64 to all IA64 nodes and an attribute of x86 to all ia32/i686 nodes. Examples using the attribute of compute refer to any compute node (ia64 or x86).
It will be necessary to edit supplied PBS scripts with the correct node assignment attributes for your cluster.
Submitting a Job
qsub is the PBS command to submit a
job. Job submission is not allow for root. At a minimum you must
specify how many nodes for how long.
To request (16) IA64 nodes with 2 processors/node, for 10 minutes interactively
and assuming that PBS has ia64 assigned
as an attribute to all the IA64 nodes, type:
$ qsub -l nodes=16:ia64:ppn=2,walltime=10:00 -I
To request 10 nodes of any attribute with 2 processors/node for 24 hours to run your PBS script foobar.pbs, type:
$ qsub -l nodes=10:ppn=2,walltime=24:00:00 foobar.pbs
Monitoring and Job Output
Use the tools pbstop, qstat, and showq to monitor PBS/Maui.
If PBS has the appropriate patches and your home directory contains a .pbs_spool subdirectory, then you can monitor your job stdout and stderr real-time with tail -f. If you do not have this patch applied your output will be in the directory submitted from in the format: jobname.ojobnumber for stdout and jobname.ejobnumber for stderr. You will have to wait for the job to complete before the files will be present.
PBS Help
RTFM.
E.g.
$ man qsub
Advice
For each benchmark run a small test first to test out all the scripts interactively, then a small test through PBS/Maui before submitting larger tests.
E.g. interactively:
$ cd ~/bench/benchmark
$ qsub -l
nodes=2,walltime=1:00:00 -I
qsub: waiting for job 0.head01.foobar.org to start
qsub: job 0.head01.foobar.org ready
----------------------------------------
Begin PBS Prologue Thu Dec 19 14:17:53 MST 2002
Job ID: 0.head01.foobar.org
Username: bob
Group: users
Nodes: node10 node9
End PBS Prologue Thu Dec 19 14:17:54 MST 2002
----------------------------------------
node10:~> cd $PBS_O_WORKDIR
node10:~/bench/benchmark> ./benchmark.pbs
node10:~/bench/benchmark> exit
STREAM
"The STREAM benchmark is a simple synthetic benchmark program that measures
sustainable memory bandwidth (in MB/s) and the corresponding computation rate
for simple vector kernels." --
http://www.cs.virginia.edu/stream/ref.html
IANS, STREAM just bangs on memory. TRIAD results are usually what to look at. STREAM is not a general purpose memory exerciser and utilizes little memory. However if you do have memory anomalies, STREAM can be effected.
STREAM results can be effected by:
When analyzing STREAM output only compare apples-to-apples. Every system setting (BIOS) must be identical. Every hardware setting, configuration, and location of DIMMs must be identical. Exact clones.
Building and running STREAM:
stream_d results benchmark jobs low high % mean median std dev c_ia64 16 3393.26 3488.88 2.81 3434.70 3441.13 28.74 c_omp_ia64 16 3811.37 3847.14 0.93 3831.52 3836.47 12.51 f_ia64 16 3394.98 3488.65 2.75 3433.98 3440.29 28.07 f_tuned_ia64 16 3390.86 3480.46 2.64 3432.07 3430.85 24.51
The number of jobs should equal the number of nodes even if you specified a ppn > 1. % (max variation from low to high) should be < 5%. Usually for STREAM it is 1-2%. Compiler options can have a large impact on variability and performance. E.g. after changing the optimization level from -O3 to -O2 stability increased, but performance dropped:
stream_d results benchmark jobs low high % mean median std dev c_ia64 16 1038.20 1044.71 0.62 1041.85 1041.93 1.55 c_omp_ia64 16 1980.83 2005.46 1.24 1992.89 1992.90 6.73 f_ia64 16 1038.35 1046.26 0.76 1042.57 1042.57 1.93 f_tuned_ia64 16 1038.84 1044.78 0.57 1041.77 1041.76 1.47
Edit any of the makefiles and rebuild and retest until you get the desired
performance and stability. If you cannot reduce variability to < 5% you
may have non-identical hardware configurations or hardware problems.
Analyze each output file in
~/bench/stream/output to find the
irregularities.
Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM results should be achieved first.
NPB Serial benchmark jobs low high % mean median std dev bt.A 16 1132.79 1142.69 0.87 1139.64 1140.16 2.41 cg.A 16 238.53 241.72 1.33 240.42 240.32 0.74 ep.A 16 21.96 21.99 0.13 21.97 21.99 0.01 ft.A 16 619.41 625.72 1.01 623.12 623.10 1.76 is.A 16 49.17 50.01 1.70 49.82 49.87 0.21 lu.A 16 830.55 859.95 3.53 839.37 836.44 9.15 mg.A 16 1056.95 1069.20 1.15 1062.30 1062.17 3.67 sp.A 16 713.53 719.78 0.87 716.31 716.83 2.01
The number of jobs should equal the n you passed to ./runit n. % (max variation from low to high) should be < 5%. Compiler options can have a large impact on variability and performance.
NPB OpenMP
"The NAS Parallel Benchmarks (NPB)
are a set of 8 programs designed to help evaluate the performance of parallel
supercomputers. The benchmarks, which are derived from computational fluid
dynamics (CFD) applications, consist of five kernels and three
pseudo-applications. The NPB come in several flavors. NAS solicits performance
results for each from all sources." --
http://www.nas.nasa.gov/NAS/NPB
The NAS Serial Benchmarks are the same as the NAS Parallel Benchmarks except
that MPI calls have been replaced with OpenMP calls to run multiple processor on
a shared memory system.
Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM and NPB Serial results should be achieved first.
NPB OpenMP benchmark jobs low high % mean median std dev bt.A 16 2044.48 2064.61 0.98 2054.67 2054.68 4.60 cg.A 16 418.72 441.48 5.43 429.96 430.39 4.72 ep.A 16 44.43 44.46 0.06 44.45 44.45 0.00 ft.A 16 1042.91 1060.65 1.70 1053.08 1053.53 4.66 lu.A 16 1739.11 1963.40 12.89 1856.81 1873.96 67.51 mg.A 16 1467.50 1506.04 2.62 1486.37 1486.54 11.85 sp.A 16 1087.26 1104.33 1.57 1096.15 1095.74 5.68
The number of jobs should equal the
n you passed to
./runit n.
% (max variation from low to high) should be < 5%, however multiprocessor
tests can have higher variability. You may want to analyze the output and
track down the nodes that are increasing the variability, then rerun. If
variability moves around you don't have a hardware problem, E.g.:
lu.A has a high degree of variability.
$ cd ~/bench/NPB3.0/NPB3.0-OMP/output
$ grep total *lu* | sort +4 -n
lu.A.node012.88.master: Mop/s total = 1739.11 lu.A.node009.91.master: Mop/s total = 1748.24 lu.A.node008.92.master: Mop/s total = 1766.47 lu.A.node005.95.master: Mop/s total = 1800.81 lu.A.node002.98.master: Mop/s total = 1803.29 lu.A.node016.84.master: Mop/s total = 1819.56 lu.A.node007.93.master: Mop/s total = 1850.71 lu.A.node010.90.master: Mop/s total = 1866.03 lu.A.node011.89.master: Mop/s total = 1873.96 lu.A.node004.96.master: Mop/s total = 1899.82 lu.A.node001.99.master: Mop/s total = 1903.54 lu.A.node013.87.master: Mop/s total = 1903.75 lu.A.node006.94.master: Mop/s total = 1914.18 lu.A.node015.85.master: Mop/s total = 1916.82 lu.A.node014.86.master: Mop/s total = 1939.39 lu.A.node003.97.master: Mop/s total = 1963.40
Note that output is evenly disbursed. This
usually indicates software, but not necessarily a problem. Rerun the
benchmark.
$ cd ~/bench/NPB3.0/NPB3.0-OMP/bin
Remove the benchmarks that you do not need to retest.
$ rm bt.A cg.A ...
$ cd ..
$ mv output output1
$ ./runit large n (you want multiple runs/node)
$ cd output
$ grep total *lu* | sort +4 -n | grep node013
lu.A.node013.172.master: Mop/s total = 1683.24 lu.A.node013.161.master: Mop/s total = 1834.06 lu.A.node013.151.master: Mop/s total = 1867.71 lu.A.node013.135.master: Mop/s total = 1943.91
Checking a random node the output varies. Higher variation for this benchmark may be normal. Compiler options can also have a large impact on variability and performance.
"Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance. Then you can decide which test is important and decide how to compare different systems after running it.." -- http://www.coker.com.au/bonnie++
Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM and NPB Serial results should be achieved first.
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
node10 8G 9951 91 42199 39 19159 6 10102 88 44015 8 194.5 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 7894 98 +++++ +++ 10136 99 8426 99 +++++ +++ 8539 100
node10,8G,9951,91,42199,39,19159,6,10102,88,44015,8,194.5,0,16,7894,98,+++++,+++,10136,99,8426,99,+++++,+++,8539,100
Do this for each file system you plan to test. Record the total time.
bonnie benchmark jobs low high % mean median std dev /tmp(w) 16 45433 47299 4.10 46578 46986 671.51 /tmp(r) 16 33037 47387 43.43 40879 42200 4990.12 /scr/PBS(w) 16 54113 55015 1.66 54567 54588 258.93 /scr/PBS(r) 16 45238 50240 11.05 46558 46037 1404.58
This analyze script only reports sequential block read and write performance
in K/s. The number of jobs should equal the number of nodes even if you
specified a ppn > 1.
% (variation from low to high) should be < 5%. If you cannot reduce
variability to < 5% you may have non-identical hardware configurations,
hardware, or software problems. Analyze each output file in
~/bench/stream/output to find the
irregularities. Rerun tests manually to validate software or hardware
irregularities.
In the example above /tmp is on the
system disk with swap. bonnie++ uses all memory and will increase swap
usage. It is possible that may create anomalies when benchmarking a system
disk. The /scr file system
however is closer to < 5%. The mean and the median are very similar and
close to the low indicating that there may be a few exceptionally high values.
To check type:
$ cd $bench/bonnie/output
$ tail -q -n 1 *scr* | awk -F, '{print $1 " " $5}' | sort +1 -n
node011 45238 node001 45782 node014 45831 node008 45885 node012 45933 node006 45955 node015 45985 node010 46035 node003 46037 node009 46046 node005 46076 node016 46195 node013 46782 node004 46852 node002 50064 node007 50240
Node 2 and 7 are significantly larger than the rest throwing off the variability. Multiple reruns should be performed to determine if it is software or hardware. It is also possible that bonnie++ is not a good benchmark.
Ping-Pong is a simple benchmark that measures latency and bandwidth for different message sizes. There are many ping-pong benchmarks available. The origin of the pingpong.c included with xCAT is unknown.
pingpong benchmark jobs low high % mean median std dev latency 8 11.74 11.79 0.42 11.76 11.77 0.02 4000.bw 8 64.90 65.00 0.15 64.97 65.00 0.04 8000.bw 8 85.40 85.50 0.11 85.46 85.50 0.05 16000.bw 8 100.50 100.60 0.09 100.52 100.50 0.04 32000.bw 8 163.40 163.50 0.06 163.43 163.40 0.05 64000.bw 8 191.70 191.90 0.10 191.81 191.80 0.06 128000.bw 8 209.80 209.90 0.04 209.85 209.90 0.05 256000.bw 8 219.20 219.40 0.09 219.30 219.30 0.05 512000.bw 8 224.40 224.60 0.08 224.46 224.50 0.07 1024000.bw 8 227.30 227.50 0.08 227.40 227.40 0.05 2048000.bw 8 229.70 229.80 0.04 229.75 229.80 0.05The number of jobs should = 1/2 of nodes= in pingpong.pbs even if you specified a ppn > 1. % (variation from low to high) should be < 5%. If you cannot reduce variability to < 5% you may have non-identical hardware configurations, hardware, or software problems. Analyze each output file in ~/bench/pingpong/output to find the irregularities. Rerun tests manually to validate software or hardware irregularities.
Pallas MPI Benchmark (PMB) provides a concise set of benchmarks targeted at measuring the most important MPI functions.
Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM, NPB Serial, and Ping-Pong results should be achieved first.
Download PMB2.2 from
http://www.pallas.com/e/products/pmb and save to
/tmp, then extract in
~/bench/PMB.
$ cd ~/bench/PMB
$ tar zxvf /tmp/PMB2.2.tar.gz
Build:
$ cd
~/bench/PMB/SRC
$ cp -f Makefile.xcat Makefile
$ make clean
$ make
$ cp PMB-MPI1 ..
Edit pmb.pbs and correct
the settings for the PBS command options (usually just the attributes).
nodes should = as many nodes you want
to test, and
ppn should = the total number of
processors in the node (usually =2). The walltime
should be very large for a large cluster. 24 hours is good. Edit
the MPICH
environmental variable for the location of MPICH.
Manual test:
$ cd ~/bench/PMB
$ qsub -l nodes=2:compute:ppn=2,walltime=1:00:00 -I
$ cd $PBS_O_WORKDIR
$ ./pmb.pbs
Check pbm.2x2.out for errors.
Submit:
$ cd ~/bench/PMB
$ qsub pmb.pbs
Analyze. Check the pbm*.out file for errors. The purpose of this test is to validate that it runs and that there are no problems with MPI.
"The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB
Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM and NPB Serial results should be achieved first and PMB should pass.
NPB MPI benchmark jobs low high % mean median std dev bt.C.16 8 4962.76 5019.85 1.15 5001.53 5009.26 19.22 cg.C.16 8 2099.15 2206.02 5.09 2168.93 2193.77 38.09 ep.C.16 8 351.20 352.30 0.31 352.01 352.29 0.47 ft.C.16 8 3875.62 4082.68 5.34 3994.40 4047.88 79.27 is.C.16 8 132.17 144.62 9.41 139.09 141.56 4.95 lu.C.16 8 12285.40 12461.64 1.43 12382.44 12388.76 50.90 mg.C.16 8 9305.53 9711.40 4.36 9449.80 9462.39 131.44 sp.C.16 8 4373.73 4457.04 1.90 4434.64 4448.91 28.32
The number of jobs should equal the
n you passed to
./runit n.
% (variation from low to high) should be < 5%, however multiprocessor
and multinode tests can have higher variability. You may want to analyze the output and
track down the jobs that are increasing the variability, then rerun. If
variability moves around you don't have a hardware problem. Higher variation for this benchmark may be normal. Compiler options
can also have a large impact on variability and performance.
In this example is.C.16 has high
variability, it also ran in < 10 seconds on IA64. This benchmark may be
too small to get consistency, and unfortunately the
is benchmark does not have a larger
(class D) problem size.
"HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark." -- http://www.netlib.org/benchmark/hpl
Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM and NPB Serial and Parallel results should be achieved first and PMB should pass.


Support
Egan Ford
egan@us.ibm.com
July 2003