2003-06-27
| Revision History | ||
|---|---|---|
| Revision 0.9.1 | 2003-06-27 | NH |
| Updates & cleanups. We are working towards a release ready document | ||
| Revision 0.9 | 2003-03-07 | NH |
| add section to cmsubmit about placing the arguments directly in the script | ||
| Revision 0.8 | 2003-02-12 | NH |
| re-organize into a section per type type | ||
| Revision 0.7.1 | 2002-11-01 | NH |
| add MPI and Interactive job help | ||
| Revision 0.7 | 2002-10-21 | NH |
| first draft | ||
Abstract
This guide will help you to get familiar with the user environment, scheduling commands, and job parameters that are used by Clubmask. If you have any ideas or questions about this document, please feel free to email us at clubmask-users AT lists DOT sourceforge DOT net.
Table of Contents
There are a few basic ideas that need to be understood before using a cluster.
A node is a machine. The machine may have multiple processors, differnet network adapters, and varying amounts of ram and diskspace, but all in all, it is just a machine. Simple eh?
A job is a basic 'unit of work'. Jobs typically consist of common computations grouped together.
Job Types
The environment that your job will see is the same envrionment that was used to submit the job. In addition, there are a few environment variables that Clubmask uses to inform your job script.
The JOBID is the numerical ID that was assigned to your job at submission time. EX: 1015142526.538906
NODES: space delimited string of node numbers that your job has been assigned. The NODES variable is how you determine which nodes of the cluster your job will occupy. There is no guarantee that NODES will contain contiguous nodes, as the nodes may be scattered across the cluster. Example: NODES="0 0 5 5 2 2 9 9"
Note: If your job has been assigned a dual cpu machine, the node number for it will appear twice, similarily for nodes with quad cpus, it will appear 4 times. |
Clubmask is based on the Single System Image (SSI) paradigm of Beowulf clusters. When using the SSI method of executing processes on the nodes, the cluster utilzes the head node as a starting point for all of the processes. Yes that is right -- the head node is used to start any process you wish to run on the nodes. There are 2 ways that the processes can be started, either via 'bpsh', or 'ssh'.
Clubmask uses a distributed process control layer called bproc. Bproc allows you to see all of the processes running on remote nodes as if they were running on the main login server. To start programs on nodes using Bproc, you will need to use the command 'bpsh'. In Clubmask, the nodes are numbered starting at 0 and increase as per your setup. Bpsh uses that node number when executing a process.
Example |
[henken@alpha henken]$ bpsh 2 hostname
node2
|
When executing your processes on remote nodes, Bproc places a ghost process on the front-end machine that are used to communicate with the remote process. This ghost process is basically a place holder for the real remote process. For more inforamation, refer to the Bproc documentation.
This is easier to see in the following example, as you can see the process in the ps output on the front-end, but the output indicates the process is actually running on the remote node.
Example |
[root@struggles clubmask]# bpsh 1 sleep 5
|
Here is the output from ps:
Example |
[root@struggles autoconf-2.57]# ps -axf
PID TTY STAT TIME COMMAND
5289 pts/9 S 0:00 bpsh 1 sleep 5
5291 pts/9 SW 0:00 \_ [sleep]
|
As you can see, the sleep process is running as if it was a kernel thread ( the square brackets ). This sleep process is actually running on node 1.
Ssh is the standard secure shell for linux. As stated above, the nodes are numbered starting at 0, but for ssh to correctly resolve the nodenumber to a host, you must prefix the word 'node' to the beginning of the nodenumber.
Example |
[henken@alpha henken]$ ssh node2 hostname
node2
|
Important: Your cluster may not be setup to use SSH, or your administrator may have setup a different node number prefix than 'node'. |
There are several tradeoffs between bpsh and ssh. First and foremost is the difference in the time in which each takes to start executing the process on the node. Bpsh is several times faster than ssh, and it supports complex node list arguments that allow the processes to be started in near parallel fashion.
Example |
[henken@testcluster henken]$ time bpsh 1-5,8,19-31 hostname
node1.internal
node2.internal
node3.internal
node4.internal
node5.internal
node8.internal
node19.internal
node20.internal
node21.internal
node22.internal
node23.internal
node24.internal
node25.internal
node26.internal
node27.internal
node28.internal
node29.internal
node30.internal
node31.internal
real 0m0.082s
user 0m0.040s
sys 0m0.170s
|
Example |
[henken@testcluster henken]$ time for node in 1 2 3 4 5 8 19 20 21 22 23 24 25 26 27 28 29 30 31; do ssh node$node hostname; done
node1.internal
node2.internal
node3.internal
node4.internal
node5.internal
node8.internal
node19.internal
node20.internal
node21.internal
node22.internal
node23.internal
node24.internal
node25.internal
node26.internal
node27.internal
node28.internal
node29.internal
node30.internal
node31.internal
real 0m8.004s
user 0m3.830s
sys 0m0.150s
|
Now that I have sung the praises of bpsh, it does have one downfall. Java does not like it. There are a few solutions, my favorite being to just not use Java :). The other is to use ssh for anything that you need to do in Java.
Here we will walk you through the necessary steps to create, submit, and run jobs. The 3 basic styles of jobs that will be covered are Batch, MPI, and Interactive. Each job style has it's own set of required steps, so be sure to read all of the steps for the type of job you want to run.
In both MPI and Batch style jobs, there is a script that is written and submitted to the scheduler that does ALL of the work that the job needs to do. In comparison, in an Interactive job, the scheduler merely reserves the nodes and changes the permissions on each so that you can run commands on the nodes. Interactive jobs are usually used so that a person can become familiar with the computing environment and test their code before attempting a long production run.
When a script is submitted to the scheduler, the scheduler will take the request, and process it through it's algorithms and determine on which nodes it will run, and at what time the job will start. The same is done for an Interactive job as well.
Note: A common misconception is that a Batch or MPI style job will just setup the nodes for a job, and then allow you to run commands from the command line. This is simply not true. If you are going to use Batch or MPI jobs, the script that is submitted to the scheduler must do all of the work. The scheduler will automatically detect when the script has exited, and return the nodes to the scheduler for further use. |
In a script job, there is always some output that is written to either standard out, stdout, or standard error,stderr. In Clubmask, the two output streams are combined into one stream, and redirected to a file. This file is named $JOBID.stdout. If you submitted a job, and were assigned JOBID 0212111336.261588, the output would be redirected to a file named 0212111336.261588.stdout. This file is usually placed in the home directory of the user, but the placement can be changed with job submission arguments. See cmsubmit for more details.
The most basic Batch job will take a given set of nodes and run a command on each node. The following Batch job will use bpsh to run 'hostname' on each node.
A Batch job script needs to parse the environemnt variables given by the scheduler, and start the processes.
Example |
#!/bin/bash # Echo variables from scheduler /bin/echo "Starting job:$JOBID on $NUMNODES nodes:$NODES" # now run bpsh on each node for node in $NODES; do bpsh $node hostname done |
Note: It is worthy to mention that the above script executes the the bpsh's in serial, not in parallel. If this concerns you, please refer to the advanced section on parallelism |
Now that we have a script for the scheduler to run, we need to submit it to be run. There are 2 important arguments that are needed when submitting a script. These are the number of nodes on which to run, and the wall clock duration of the job. To run a job on 4 processors for 10 minutes:
Example |
[henken@testcluster jobs]$ cmsubmit -p 4 -t 10 basic.sh
_______________________________________________________________
Welcome to Clubmask!!!
_______________________________________________________________
==================================================
Job Summary
==================================================
Job Identification Number: 0212141641.546417
Username: henken
Group: 2940
Queue: Batch
Number of Processors: 4
Maximum Wall-clock Run-Time (minutes): 10
Program: /mnt/io1/genomics/home/henken/jobs/basic.sh
Program Arguments:
Initial Working Directory: /mnt/io1/genomics/home/henken/jobs
Job Output Directory: /home/henken
==================================================
Commit or Abort: ('c', 'a') ? c
please wait while database is updated...
please wait while database connection is closed...
|
Example |
[henken@testcluster jobs]$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
0212141641.546417 henken Running 4 0:10:00 Wed Feb 12 14:17:31
1 Active Job 4 of 64 Processors Active (6.25%)
2 of 32 Nodes Active (6.25%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 1 Active Jobs: 1 Idle Jobs: 0 Blocked Jobs: 0
|
Now that the job is running, you may wish to watch the progress of the job. This can be done by viewing the stdout file for the job.
Example |
[henken@testcluster jobs]$ tail -f ~/0212141641.546417.stdout Starting job:0212141641.546417 with command:/usr/local/clubmask/packages/maui/run/0212141641.546417.program and args: to run on nodes:1 1 21 21 Starting job:0212141641.546417 on 4 nodes:1 1 21 21 node1.internal node1.internal node21.internal node21.internal |
Note: The stdout file above will contain all of the output for your job. It is there you should look for any errors. |
The MPI environment in Clubmask is lam-mpi. In order to write a script for a MPI job, the basics described above in the Batch job section will be used.
The difference between a basic Batch job and an MPI job lie in the script that is submitted to the scheduler. In order to use lma-mpi, there are a few specific setup steps, as well as a different method of running the program. Below is a sample MPI script that will run the classic MPI sample program cpi. This program can be found in the samples section. Please pay carefull attention to the comments in the script for a description of what each step is doing.
Example |
#!/bin/bash
# prog name
PROG="/home/henken/pi"
# Your jobid from the scheduler
/bin/echo "JOBID:$JOBID"
# The nodes the scheduler has assigned your job
/bin/echo "NODES:$NODES"
# create lamhosts file
# This tells the lam environment what nodes you can use.
# The '-1' is neccessary to tell lam that we are going to start the mpi
# processes from the master node.
echo "-1" >> /tmp/lamnodes.$JOBID
for i in $NODES; do
echo "$i">> /tmp/lamnodes.$JOBID
done
# boot up lam -- absoluteley necessary
# This will start a lamd on each node.
lamboot -b -x -v /tmp/lamnodes.$JOBID
# run the mpi program with some args
# see mpirun --help for more details
# The recommended way of running MPI jobs utilizes one of the following two
# mpirun commands. Notice the use of only nodes 1-X and cpus 1-x. This tells
# lam to NOT execute a copy of the program on the master node.
# run one copy per cpu
mpirun c1-$NUMNODES $PROG
# run on copy per node
mpirun n1-$NUMNODES $PROG
# clean up -- not absoluteley necessary, but a darn good idea
lamclean -v
#end lam session -- absoluteley necessary
# if not done, running another mpi job may fail
lamhalt -v
|
Example |
[henken@beta alpha]$ mpicc -o cpi ~/cpi.c
[henken@beta alpha]$ cmsubmit -p 4 -t 10 cpi.sh
_______________________________________________________________
Welcome to Clubmask!!!
_______________________________________________________________
==================================================
Job Summary
==================================================
Job Identification Number: 0212153059.262914
Username: henken
Group: 2940
Job Classification: Batch
Number of Processors: 4
Maximum Wall-clock Run-Time (minutes): 10
Program: /home/henken/alpha/cpi.sh
Program Arguments:
Initial Working Directory: /home/henken/alpha
Job Output Directory: /home/henken
==================================================
Commit or Abort: ('c', 'a') ? c
please wait while database is updated...
please wait while database connection is closed...
|
Example |
[henken@beta alpha]$ more ~/0212153059.262914.stdout Starting job:0212153059.262914 with command:/usr/local/clubmask/packages/maui/run/0212153059.262914.program and args:./cpi to run on nodes:11 11 16 16 JOBID:0212153059.262914 NODES:11 11 16 16 NUMNODES: 4 LAM 6.6b2cvs/MPI 2 C++/ROMIO/bproc - Indiana University Executing hboot on n0 (-1 - 1 CPU)... Executing hboot on n1 (11 - 2 CPUs)... Executing hboot on n2 (16 - 2 CPUs)... topology done Process 2 on 16 Process 0 on 11 Process 1 on 11 Process 3 on 16 pi is approximately 3.1416009869231249, Error is 0.0000083333333318 wall clock time = 2.386186 killing processes, done closing files, done sweeping traces, done cleaning up registered objects, done sweeping messages, done LAM 6.6b2cvs/MPI 2 C++/ROMIO/bproc - Indiana University Shutting down LAM LAM halted |
The baic premise behind an Interactive job is asking the scheduler to reserve some set of nodes so that commands may be issued to the nodes from the command line. The first step is the request the nodes from the scheduler, as the following does for 4 processors for 30 minutes.
Example |
[henken@beta henken]$ cmsubmit -p 4 -I -t 30
_______________________________________________________________
Welcome to Clubmask!!!
_______________________________________________________________
==================================================
Job Summary
==================================================
Job Identification Number: 0212155550.389526
Username: henken
Group: 2940
Job Classification: Interactive
Number of Processors: 4
Maximum Wall-clock Run-Time (minutes): 30
Program:
Program Arguments:
Initial Working Directory: /home/henken
Job Output Directory: /home/henken
==================================================
Commit or Abort: ('c', 'a') ? c
please wait while database is updated...
please wait while database connection is closed..
|
Example |
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
0212155550.389526 henken Running 4 0:30:00 Wed Feb 12 15:55:54
15 Active Jobs 4 of 40 Processors Active (10.00%)
2 of 20 Nodes Active (10.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
|
Example |
[henken@beta henken]$ cmgetnodes 0212160055.851849 11 11 16 16 |
Example |
[henken@beta henken]$ bpsh 11,16 hostname node11.internal node16.internal |
Example |
[henken@beta henken]$ canceljob 0212160055.851849 job '0212160055.851849' cancelled |
Note: The use of canceljob is not entirely necessary here, as the job will be stopped at the end of the allotted time, but it is a courtesy to the other users to return them when done. If your cluster administrator notices abuse of this 'nicety', the scheduler may be instructed to limit the time and number of Interactive jobs to prevent users from tying up the entire system and preventing the execution of scripted jobs. |
The scheduler is setup by your system administrator with a set of policies that will determine where and when your job can run.
cmsubmit is the program that is used to submit a job to the scheduler. The job will be added to the queue(s), and run when the scheduler determines the job is fit for execution.
There are two required options for cmsubmit, the number of processors and the estimated maximum time (in minutes) the job will take.
Note: Your cluster may contain nodes with multiple processors. |
Warning: Your cluster may be configured to kill jobs that exceed their wallclock limit, so you might want to overestimate by a bit. |
The program that you will use to submit your job is 'cmsubmit'. Below is a sample usage of cmsubmit. You can see that the job is requesting 128 processors and has a timelimit of 2160 minutes. The program to be run is noop.sh with the argument of 1000.
Example |
[henken@alpha ~/jobs]$ cmsubmit -p 128 -t 2160 noop.sh 1000 |
When you submit a job, cmsubmit will output some information and ask for confirmation. This is what you would see if you entered the above command. You will see that the job id for this job is 1926351674. You should also double check the data to make sure you entered in the variables correctly.
Example |
==================================================
Job Summary
==================================================
Job Identification Number: 1926351674
Username: henken
Job Classification: Batch
Number of Nodes: 128
Maximum Wall-clock Run-Time (minutes): 2160
Program/Job Script: /home/henken/jobs/noop.sh
Command Line Arguments: 2000
User henken: 128 processors for 2160 minutes.
==================================================
Commit or Abort: ('c', 'a') ? c
|
There are several optional arguments that you can give cmsubmit to control it's behavior a bit more.
Cmsubmit will search your script for lines starting with the text marker # CM, and use the arguments in those lines just as if you would have typed them in on the command line. This allows you to just have one arguement to cmsubmit, without having to remember the parameters for that job. Also, when running lots of different jobs, only one script for each job needs to be maintained, not a script to submit the job, as well as a script that is the job.
Example |
#!/bin/bash # CM -p 32 # CM -y # CM -t 123 # Echo variables from scheduler /bin/echo "Starting job:$JOBID on $NUMNODES nodes:$NODES" # now run bpsh on each node for node in $NODES; do bpsh $node hostname done |
You could now submit the above script by typing cmsubmit job.sh, with the args in the script would be the equivalent of doing cmsubmit -p 32 -y -t 123 job.sh on the command line.
Note: If you specify arguements on the command line and in the script, the arguments from the command line will override those in the script. |
Once you have submitted your job, the next command you will use is 'showq'. Showq will show you the running and idle queues in the scheduler.
Note: Your job may take up to one minute to appear in the queue after submission. |
Example |
ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 1021213635.183226 feisha Running 2 1:51:50 Mon Oct 21 21:36:46 1021192537.570371 libin Running 2 2:07:00:54 Mon Oct 21 19:25:50 2 Active Jobs 4 of 46 Processors Active (8.70%) 2 of 23 Nodes Active (8.70%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 2 Active Jobs: 2 Idle Jobs: 0 Blocked Jobs: 0 |
To cancel a job, you will use the 'canceljob' canceljob. Canceljob takes the jobid of a job as it's argument.
Example |
[henken@beta henken]$ canceljob 0212155550.389526 job '0212155550.389526' cancelled |
There are a few more scheduler commands that you will find usefull. The links for each command are to the official Maui scheduler webpages.