Clubmask User Guide

Author: Nicholas Henke

University of Pennsylvania
Liniac Project
<henken@seas.upenn.edu>

Author: Daniel Widyono

University of Pennsylvania
Liniac Project
<widyono AT cis.upenn.edu>

2003-06-27

Revision History
Revision 0.9.1 2003-06-27 NH
Updates & cleanups. We are working towards a release ready document
Revision 0.9 2003-03-07 NH
add section to cmsubmit about placing the arguments directly in the script
Revision 0.8 2003-02-12 NH
re-organize into a section per type type
Revision 0.7.1 2002-11-01 NH
add MPI and Interactive job help
Revision 0.7 2002-10-21 NH
first draft

Abstract

This guide will help you to get familiar with the user environment, scheduling commands, and job parameters that are used by Clubmask. If you have any ideas or questions about this document, please feel free to email us at clubmask-users AT lists DOT sourceforge DOT net.


Table of Contents

1. Introduction
1.1. Nodes
1.2. What is a job?
2. Environment Variables
2.1. JOBID
2.2. NODES
2.3. NUMNODES
3. Clubmask Process Environment
3.1. bpsh
3.2. ssh
3.3. bpsh vs ssh
4. Running Jobs
4.1. Script jobs vs Interactive
4.2. Batch Job
4.3. MPI Job
4.4. Interactive job
5. Job Submission and Cancelation
5.1. cmsubmit
5.2. showq
5.3. canceljob
5.4. Additional Commands
6. Samples

1. Introduction

There are a few basic ideas that need to be understood before using a cluster.

1.1. Nodes

A node is a machine. The machine may have multiple processors, differnet network adapters, and varying amounts of ram and diskspace, but all in all, it is just a machine. Simple eh?

1.2. What is a job?

A job is a basic 'unit of work'. Jobs typically consist of common computations grouped together.

Job Types

Batch Job
A Batch job is a job that consists of computations that are run from a shell script which runs individual computations on nodes. Batch jobs are usually scripts that start the computations on the nodes, without any user interaction. A Batch job can also be a script to setup and run an MPI program, or someother type of paralell job.
Interactive Job
An Interactive job is a job where the nodes are reserved for the user to log into and issues commands by hand. Interactive jobs should only be used for testing batch scripts, as they will decrease the utilization of your cluster.

2. Environment Variables

The environment that your job will see is the same envrionment that was used to submit the job. In addition, there are a few environment variables that Clubmask uses to inform your job script.

2.1. JOBID

The JOBID is the numerical ID that was assigned to your job at submission time. EX: 1015142526.538906

2.2. NODES

NODES: space delimited string of node numbers that your job has been assigned. The NODES variable is how you determine which nodes of the cluster your job will occupy. There is no guarantee that NODES will contain contiguous nodes, as the nodes may be scattered across the cluster. Example: NODES="0 0 5 5 2 2 9 9"

Note: If your job has been assigned a dual cpu machine, the node number for it will appear twice, similarily for nodes with quad cpus, it will appear 4 times.

2.3. NUMNODES

NUMNODES is the number of nodes in NODES. If NODES="1 1 2 2 3 3", NUMNODES will be 6.

3. Clubmask Process Environment

Clubmask is based on the Single System Image (SSI) paradigm of Beowulf clusters. When using the SSI method of executing processes on the nodes, the cluster utilzes the head node as a starting point for all of the processes. Yes that is right -- the head node is used to start any process you wish to run on the nodes. There are 2 ways that the processes can be started, either via 'bpsh', or 'ssh'.

3.1. bpsh

Clubmask uses a distributed process control layer called bproc. Bproc allows you to see all of the processes running on remote nodes as if they were running on the main login server. To start programs on nodes using Bproc, you will need to use the command 'bpsh'. In Clubmask, the nodes are numbered starting at 0 and increase as per your setup. Bpsh uses that node number when executing a process.

Example

[henken@alpha henken]$ bpsh 2 hostname
node2
        

3.1.1. Ghost Processes

When executing your processes on remote nodes, Bproc places a ghost process on the front-end machine that are used to communicate with the remote process. This ghost process is basically a place holder for the real remote process. For more inforamation, refer to the Bproc documentation.

This is easier to see in the following example, as you can see the process in the ps output on the front-end, but the output indicates the process is actually running on the remote node.

Example

[root@struggles clubmask]# bpsh 1 sleep 5
            

Here is the output from ps:

Example

[root@struggles autoconf-2.57]# ps -axf
  PID TTY      STAT   TIME COMMAND
  5289 pts/9    S      0:00  bpsh 1 sleep 5
  5291 pts/9    SW     0:00  \_ [sleep]
            

As you can see, the sleep process is running as if it was a kernel thread ( the square brackets ). This sleep process is actually running on node 1.

3.2. ssh

Ssh is the standard secure shell for linux. As stated above, the nodes are numbered starting at 0, but for ssh to correctly resolve the nodenumber to a host, you must prefix the word 'node' to the beginning of the nodenumber.

Example

[henken@alpha henken]$ ssh node2 hostname
node2
        

Important: Your cluster may not be setup to use SSH, or your administrator may have setup a different node number prefix than 'node'.

3.3. bpsh vs ssh

There are several tradeoffs between bpsh and ssh. First and foremost is the difference in the time in which each takes to start executing the process on the node. Bpsh is several times faster than ssh, and it supports complex node list arguments that allow the processes to be started in near parallel fashion.

Example

[henken@testcluster henken]$ time bpsh 1-5,8,19-31 hostname
node1.internal
node2.internal
node3.internal
node4.internal
node5.internal
node8.internal
node19.internal
node20.internal
node21.internal
node22.internal
node23.internal
node24.internal
node25.internal
node26.internal
node27.internal
node28.internal
node29.internal
node30.internal
node31.internal

real    0m0.082s
user    0m0.040s
sys     0m0.170s
        
To achieve the same thing via ssh you would need to invoke a separate ssh for each node.

Example

[henken@testcluster henken]$ time for node in 1 2 3 4 5 8 19 20 21 22 23 24 25 26 27 28 29 30 31; do ssh node$node hostname; done
node1.internal
node2.internal
node3.internal
node4.internal
node5.internal
node8.internal
node19.internal
node20.internal
node21.internal
node22.internal
node23.internal
node24.internal
node25.internal
node26.internal
node27.internal
node28.internal
node29.internal
node30.internal
node31.internal

real    0m8.004s
user    0m3.830s
sys     0m0.150s
        
As you can see the ssh version took just over 8 seconds, where the bpsh invocation took just under .1 seconds.

Now that I have sung the praises of bpsh, it does have one downfall. Java does not like it. There are a few solutions, my favorite being to just not use Java :). The other is to use ssh for anything that you need to do in Java.

4. Running Jobs

Here we will walk you through the necessary steps to create, submit, and run jobs. The 3 basic styles of jobs that will be covered are Batch, MPI, and Interactive. Each job style has it's own set of required steps, so be sure to read all of the steps for the type of job you want to run.

4.1. Script jobs vs Interactive

In both MPI and Batch style jobs, there is a script that is written and submitted to the scheduler that does ALL of the work that the job needs to do. In comparison, in an Interactive job, the scheduler merely reserves the nodes and changes the permissions on each so that you can run commands on the nodes. Interactive jobs are usually used so that a person can become familiar with the computing environment and test their code before attempting a long production run.

When a script is submitted to the scheduler, the scheduler will take the request, and process it through it's algorithms and determine on which nodes it will run, and at what time the job will start. The same is done for an Interactive job as well.

Note: A common misconception is that a Batch or MPI style job will just setup the nodes for a job, and then allow you to run commands from the command line. This is simply not true. If you are going to use Batch or MPI jobs, the script that is submitted to the scheduler must do all of the work. The scheduler will automatically detect when the script has exited, and return the nodes to the scheduler for further use.

4.1.1. Job Output

In a script job, there is always some output that is written to either standard out, stdout, or standard error,stderr. In Clubmask, the two output streams are combined into one stream, and redirected to a file. This file is named $JOBID.stdout. If you submitted a job, and were assigned JOBID 0212111336.261588, the output would be redirected to a file named 0212111336.261588.stdout. This file is usually placed in the home directory of the user, but the placement can be changed with job submission arguments. See cmsubmit for more details.

4.2. Batch Job

The most basic Batch job will take a given set of nodes and run a command on each node. The following Batch job will use bpsh to run 'hostname' on each node.

4.2.1. Batch Job Script

A Batch job script needs to parse the environemnt variables given by the scheduler, and start the processes.

Example

#!/bin/bash
# Echo variables from scheduler
/bin/echo "Starting job:$JOBID on $NUMNODES nodes:$NODES"

# now run bpsh on each node
for node in $NODES;
do
	bpsh $node hostname
done
	

Note: It is worthy to mention that the above script executes the the bpsh's in serial, not in parallel. If this concerns you, please refer to the advanced section on parallelism

4.2.2. Submitting the Batch Job

Now that we have a script for the scheduler to run, we need to submit it to be run. There are 2 important arguments that are needed when submitting a script. These are the number of nodes on which to run, and the wall clock duration of the job. To run a job on 4 processors for 10 minutes:

Example

[henken@testcluster jobs]$ cmsubmit -p 4 -t 10 basic.sh 
_______________________________________________________________

Welcome to Clubmask!!!
_______________________________________________________________

==================================================
Job Summary
==================================================
Job Identification Number:  0212141641.546417
Username:  henken
Group:  2940
Queue:  Batch
Number of Processors:  4
Maximum Wall-clock Run-Time (minutes):  10
Program:  /mnt/io1/genomics/home/henken/jobs/basic.sh
Program Arguments:  
Initial Working Directory: /mnt/io1/genomics/home/henken/jobs
Job Output Directory: /home/henken
==================================================
Commit or Abort: ('c', 'a') ? c
please wait while database is updated...
please wait while database connection is closed...
	
Now that the job is submitted, we want to know if the job is executing. The use of the showq command will show the state of the scheduler.

Example

[henken@testcluster jobs]$ showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

0212141641.546417    henken    Running     4     0:10:00  Wed Feb 12 14:17:31

     1 Active Job        4 of   64 Processors Active (6.25%)
                         2 of   32 Nodes Active      (6.25%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0
	
You can see that the job is executing.

4.2.3. Batch Job Output

Now that the job is running, you may wish to watch the progress of the job. This can be done by viewing the stdout file for the job.

Example

[henken@testcluster jobs]$ tail -f ~/0212141641.546417.stdout 
Starting job:0212141641.546417 with command:/usr/local/clubmask/packages/maui/run/0212141641.546417.program and args: to run on
nodes:1 1 21 21
Starting job:0212141641.546417 on 4 nodes:1 1 21 21
node1.internal
node1.internal
node21.internal
node21.internal
	
You can now watch the your job produce output as it executes. The job will run to completion, and when the scheduler detects that all of the proceeses are finished, your job will be marked as done, with the nodes that were assigned placed back in the queue for further use.

Note: The stdout file above will contain all of the output for your job. It is there you should look for any errors.

4.3. MPI Job

The MPI environment in Clubmask is lam-mpi. In order to write a script for a MPI job, the basics described above in the Batch job section will be used.

The difference between a basic Batch job and an MPI job lie in the script that is submitted to the scheduler. In order to use lma-mpi, there are a few specific setup steps, as well as a different method of running the program. Below is a sample MPI script that will run the classic MPI sample program cpi. This program can be found in the samples section. Please pay carefull attention to the comments in the script for a description of what each step is doing.

Example

#!/bin/bash

# prog name
PROG="/home/henken/pi"

# Your jobid from the scheduler
/bin/echo "JOBID:$JOBID"

# The nodes the scheduler has assigned your job
/bin/echo "NODES:$NODES"

# create lamhosts file 
# This tells the lam environment what nodes you can use.
# The '-1' is neccessary to tell lam that we are going to start the mpi
# processes from the master node.

echo "-1" >> /tmp/lamnodes.$JOBID
for i in $NODES; do
        echo "$i">> /tmp/lamnodes.$JOBID
done

# boot up lam -- absoluteley necessary
# This will start a lamd on each node.
lamboot -b -x -v /tmp/lamnodes.$JOBID

# run the mpi program with some args
# see mpirun --help for more details

# The recommended way of running MPI jobs utilizes one of the following two
# mpirun commands. Notice the use of only nodes 1-X and cpus 1-x. This tells
# lam to NOT execute a copy of the program on the master node.

# run one copy per cpu
mpirun c1-$NUMNODES $PROG

# run on copy per node
mpirun n1-$NUMNODES $PROG

# clean up -- not absoluteley necessary, but a darn good idea
lamclean -v

#end lam session  -- absoluteley necessary
# if not done, running another mpi job may fail
lamhalt -v
		

The job can be submitted as the Batch job was.

Example

[henken@beta alpha]$ mpicc -o cpi ~/cpi.c 
[henken@beta alpha]$ cmsubmit -p 4 -t 10 cpi.sh              
_______________________________________________________________

Welcome to Clubmask!!!
_______________________________________________________________

==================================================
Job Summary
==================================================
Job Identification Number:  0212153059.262914
Username:  henken
Group:  2940
Job Classification:  Batch
Number of Processors:  4
Maximum Wall-clock Run-Time (minutes):  10
Program:  /home/henken/alpha/cpi.sh
Program Arguments:
Initial Working Directory: /home/henken/alpha
Job Output Directory: /home/henken
==================================================
Commit or Abort: ('c', 'a') ? c
please wait while database is updated...
please wait while database connection is closed...
		
And now for the ever interesting output.

Example

[henken@beta alpha]$ more ~/0212153059.262914.stdout 
Starting job:0212153059.262914 with command:/usr/local/clubmask/packages/maui/run/0212153059.262914.program and args:./cpi to run on
nodes:11 11 16 16
JOBID:0212153059.262914
NODES:11 11 16 16
NUMNODES: 4

LAM 6.6b2cvs/MPI 2 C++/ROMIO/bproc - Indiana University

Executing hboot on n0 (-1 - 1 CPU)...
Executing hboot on n1 (11 - 2 CPUs)...
Executing hboot on n2 (16 - 2 CPUs)...
topology done      
Process 2 on 16
Process 0 on 11
Process 1 on 11
Process 3 on 16
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 2.386186
killing processes, done      
closing files, done      
sweeping traces, done      
cleaning up registered objects, done      
sweeping messages, done      

LAM 6.6b2cvs/MPI 2 C++/ROMIO/bproc - Indiana University

Shutting down LAM
LAM halted
		

4.4. Interactive job

The baic premise behind an Interactive job is asking the scheduler to reserve some set of nodes so that commands may be issued to the nodes from the command line. The first step is the request the nodes from the scheduler, as the following does for 4 processors for 30 minutes.

Example

[henken@beta henken]$ cmsubmit -p 4 -I -t 30
_______________________________________________________________

Welcome to Clubmask!!!
_______________________________________________________________

==================================================
Job Summary
==================================================
Job Identification Number:  0212155550.389526
Username:  henken
Group:  2940
Job Classification:  Interactive
Number of Processors:  4
Maximum Wall-clock Run-Time (minutes):  30
Program:  
Program Arguments:  
Initial Working Directory: /home/henken
Job Output Directory: /home/henken
==================================================
Commit or Abort: ('c', 'a') ? c
please wait while database is updated...
please wait while database connection is closed..
	
Now that the nodes are requested, you must wait until the job is started. The use of the showq and checkjob commands will give you a good idea when it will start.

Example

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

0212155550.389526    henken    Running     4     0:30:00  Wed Feb 12 15:55:54

    15 Active Jobs      4 of   40 Processors Active (10.00%)
                        2 of   20 Nodes Active      (10.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

	
Now that the job has reserved some nodes for us, we need to find what nodes we own. The cmgetnodes function takes a JOBID as an arguments, and returns the nodes for that job

Example

[henken@beta henken]$ cmgetnodes 0212160055.851849
11 11 16 16
	
Now that we 'own' nodes 11 and 16, it is time to do some work on them.

Example

[henken@beta henken]$ bpsh 11,16 hostname
node11.internal
node16.internal
	
With that exhausting amount of computation done, we will release the nodes back into the queue.

Example

[henken@beta henken]$ canceljob 0212160055.851849


job '0212160055.851849' cancelled

	

Note: The use of canceljob is not entirely necessary here, as the job will be stopped at the end of the allotted time, but it is a courtesy to the other users to return them when done. If your cluster administrator notices abuse of this 'nicety', the scheduler may be instructed to limit the time and number of Interactive jobs to prevent users from tying up the entire system and preventing the execution of scripted jobs.

5. Job Submission and Cancelation

The scheduler is setup by your system administrator with a set of policies that will determine where and when your job can run.

5.1. cmsubmit

cmsubmit is the program that is used to submit a job to the scheduler. The job will be added to the queue(s), and run when the scheduler determines the job is fit for execution.

5.1.1. Required Options

There are two required options for cmsubmit, the number of processors and the estimated maximum time (in minutes) the job will take.

-p #procs
The number of processors that your job requires to run.

Note: Your cluster may contain nodes with multiple processors.

-t #time_in_min
This is the maximum time in minutes that your job will require.

Warning: Your cluster may be configured to kill jobs that exceed their wallclock limit, so you might want to overestimate by a bit.

5.1.2. Submitting the job.

The program that you will use to submit your job is 'cmsubmit'. Below is a sample usage of cmsubmit. You can see that the job is requesting 128 processors and has a timelimit of 2160 minutes. The program to be run is noop.sh with the argument of 1000.

Example

[henken@alpha ~/jobs]$ cmsubmit -p 128 -t 2160 noop.sh 1000
		

When you submit a job, cmsubmit will output some information and ask for confirmation. This is what you would see if you entered the above command. You will see that the job id for this job is 1926351674. You should also double check the data to make sure you entered in the variables correctly.

Example

==================================================
Job Summary
==================================================
Job Identification Number:  1926351674
Username:  henken
Job Classification:  Batch
Number of Nodes:  128
Maximum Wall-clock Run-Time (minutes):  2160
Program/Job Script:  /home/henken/jobs/noop.sh
Command Line Arguments:  2000
User henken: 128 processors for 2160 minutes.
==================================================
Commit or Abort: ('c', 'a') ? c
		

5.1.3. Optional Arguments

There are several optional arguments that you can give cmsubmit to control it's behavior a bit more.

-c
Select a particular queue (preset by system administrators)
-d /path/to/somewhere
Change the directory to where the JOBID.stdout file will be placed. The default is the user's home directory.
-g group_name
If you are a member of multiple unix user groups, you can have the job executed under any of them. The default is your primary group.
-h
This will display the help for cmsubmit.
-I
This will request the nodes in Interactive mode. Any script, arguments, or directory information passed to cmsubmit will be ignored.
-q
This enables super-quiet mode. Cmsubmit will not print the job summary if this is enabled.
-w /path/to/somewhere
Change the initial working directory if you would like the job script started in a specific directory. The default is the current working directory that cmsubmit was executed from.
-y
Bypass the job submission confirmation question.

5.1.4. Placing arguments in the script

Cmsubmit will search your script for lines starting with the text marker # CM, and use the arguments in those lines just as if you would have typed them in on the command line. This allows you to just have one arguement to cmsubmit, without having to remember the parameters for that job. Also, when running lots of different jobs, only one script for each job needs to be maintained, not a script to submit the job, as well as a script that is the job.

Example

#!/bin/bash
# CM -p 32
# CM -y
# CM -t 123

# Echo variables from scheduler
/bin/echo "Starting job:$JOBID on $NUMNODES nodes:$NODES"

# now run bpsh on each node
for node in $NODES;
do
	bpsh $node hostname
done
		

You could now submit the above script by typing cmsubmit job.sh, with the args in the script would be the equivalent of doing cmsubmit -p 32 -y -t 123 job.sh on the command line.

Note: If you specify arguements on the command line and in the script, the arguments from the command line will override those in the script.

5.2. showq

Once you have submitted your job, the next command you will use is 'showq'. Showq will show you the running and idle queues in the scheduler.

Note: Your job may take up to one minute to appear in the queue after submission.

Here is an example output from showq, the ACTIVE jobs are jobs that are currently running, IDLE jobs are those waiting for free nodes on which to run, and BLOCKED jobs are those jobs failing to run.

Example

ACTIVE JOBS--------------------
		   JOBNAME USERNAME      STATE  PROC   REMAINING            STARTTIME

 1021213635.183226   feisha    Running     2     1:51:50  Mon Oct 21 21:36:46
 1021192537.570371    libin    Running     2  2:07:00:54  Mon Oct 21 19:25:50

	 2 Active Jobs       4 of   46 Processors Active (8.70%)
						 2 of   23 Nodes Active      (8.70%)

IDLE JOBS----------------------
		   JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
		   JOBNAME USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 2   Active Jobs: 2   Idle Jobs: 0   Blocked Jobs: 0
		
For more information on showq, see showq

5.3. canceljob

To cancel a job, you will use the 'canceljob' canceljob. Canceljob takes the jobid of a job as it's argument.

Example

[henken@beta henken]$ canceljob 0212155550.389526


job '0212155550.389526' cancelled

		

5.4. Additional Commands

There are a few more scheduler commands that you will find usefull. The links for each command are to the official Maui scheduler webpages.

  • canceljob cancel a job .

  • checkjob provide detailed status report for specified job.

  • showbf show backfill window - show resources available for immediate use.

  • showq show queued jobs.

  • showres show existing reservations.

  • showstart show estimates of when job can/will start.

6. Samples

Listed here are the scripts and programs that are used in this user guide.

Samples

Simple Batch Job
basic.sh
MPI job script
cpi.sh
MPI test cpi program
cpi.c