Eniac 2000 Beowulf Cluster - Job and Resource Management Package Authors:Kevin Milsom, Rahul Dave (kevin,rahul@grove.cis.upenn.edu) Abstract This article's intent is to describe the current status of the Eniac 2000 cluster with the drawbacks of its current subsystems for scheduling and job management. The article will then describe the new implementation of an updated scheduler, a new job management subsystem, and a new resource management subsystem. Background and Definitions The Eniac 2000 (E2K) Beowulf cluster project is a research in progress to combine the resources on many individual computers to be used for computation intense work in fields such as astro-physics, chemistry, computational biology, and radiology (more information is available at www.beowulf.org and reno.cis.upenn.edu). The current E2K system configuration involves approximately 50 machines and over 100 processors. Before starting the discssion of the project it is useful to become familar with some terms/acronyms: BPROC an acronym which stands for Beowulf Distributed Process Space which is a software package which has an API for creating a process on a remote node in the cluster Node - a machine with 1-4 processors. different profiles exist for nodes depending on their usage: fileserver, computational, application server, database, login server Job - a set of processes running on a subset of the nodes in a cluster Process - a single unit of the job that runs on a node and is assigned a processor Scheduler - responsible for assigning jobs to nodes and processors, where each processor is assigned only one process Master - node which runs the BPROC master process and also the job manager and the job scheduler Slave - computation node which runs the BPROC slave and a resource manager Resource_Manager A daemon process which runs on each node to gather information concerning the resource availability in the system and the resource consumption of processes Job_Manager A daemon process which runs on the master node which is a single entry point for the construction/destruction of jobs and also an API for gathering resource information Introduction The current E2K system configuration allows users to start and stop jobs which must terminate before a hard pre-defined deadline. This functionality is provided with a stand-alone scheduling subsystem (a combination of cron scripts and daemons) and a simple job management subsystem. In order to fully utilize the cluster, the current scheduling system maps each process of a job to an individual processor on a machine (ie a quad processor machine would have at full utilization, four processes running on it). Users specify the number of nodes a job is to use and the scheduler allocated processors using a backfill algorithm. This algorithm is, in essence, scheduling based on a first-in-first-out (FIFO)queue with the exception that it will schedule smaller jobs on unused processors if it does not affect the ordering of other jobs that we put on the queue earlier. This simple backfill scheduling subsystem allows for earlier completion times for smaller jobs while not starving larger jobs. However it does not take into account the types of jobs its schedules and it could create a higly undesirable situation by scheduling two communication bound jobs on the same node. Also it is not possible to dynamically detect such a situation and re-schedule the two jobs. The job management subsystem that the scheduler currently uses is simply rsh, which connects to the remote node, kicks off the remote processes, and returns upon completion of the process. However, it is possible to lose a handle on the process if the process forks or runs in the background. From the scheduler's perspective, this creates a runaway process which can no longer be monitored and can create scheduling problems with the assumption that the process completed. To get of the ghost processes, a simple killing the user's processes on a node does not suffice as on a smp machine, it is possible for a user to own processes belonging to other jobs. The new job management and resource management subsystems (along with an updated scheduler) look to alleviate the problems above. The new job management subsystem, as a replacement for rsh, looks to solve the ghost/running away process syndrome by creating a new API with a global process space. This global process space is a single component which strives to simulate the as a single image to the user. The user will eventually perceive the cluster as a single machine and the global process space is the first step towards this goal. The new resource management subsystem will monitor the resources on a node and process basis. Node's resources are useful to the scheduler when considering the performance guarantees that a particular job may have. On a process level, monitoring the resources can be especially powerful if it is possible to dynamically change behavior given the resources allocated or available. In other words, it is possible to re-schedule jobs when their requirements change during the phases of its execution, based on the resources that it consumes and the resources available on the system. The proposed package would tie in with the scheduler for optimal scheduling (given the job classification) and also provide resource guarantees to user's job requests. Users will be able to dynamically change the behavior of their processes via a user library which ties into the resource management subsystem. Given feedback from the current resources available, users can contruct their process resource consumption differently at various checkpoints in parallel code. To support the backend of the job management and resource management subsystems a database-type driven data structure and language was created to store information pertaining to jobs and resources. Job Management The job management subsystem comprises of a daemon process that runs on the scheduling node and a user library which is utilized in writing user applications. The job management subsystem , specifically the daemon, relies heavily on a Beowulf distributed PROCess space package (BPROC) developed by Eric Hendriks (information available at http://www.beowulf.org/software/bproc.html). BPROC represents a black box replacement for the rsh method of creating jobs on remote nodes. Beowulf Distributed Process Space (BPROC) The purpose of BPROC is create a single process identifier (pid) list on a master node which will be the single point of administration over processes spanning the entire cluster. This is achieved by running daemons on all nodes which communicate the creation of jobs and the forwarding of signals, such as SIGKILL, to jobs. The APIs provided create jobs in two fashions: first to create a job locally and then move it the assigned node or start the job directly on the remote node. The former method allows the jobs to be moved to remote nodes where all user libraries might not be installed for each process. As mentioned earlier, creating a single process id list which is global across the cluster, is the first step in simulating the cluster as a single machine to the user. BPROC Research and Investigation BPROC patches the kernel (latest patch is for 2.2.13), which requires a re-compile and a reboot of the machine. To be useful, BPROC needs to be installed on the scheduler node (also called the master node), and on all of the computation nodes (the same compiled kernel can be used for all nodes). Upon booting the new kernel, BPROC requires the spawing of the master node process on the scheduler node and subsequently slave processes on the nodes (which communicate with the master process). BPROC also requires several devices (which can be installed via make devs) and runtime loadable kernel modules (which can be installed via cd kernel; make mods). The BPROC package creates a local ghost process on the master node for each remote process on a computation node. This lightweight ghost process forwards all signals (such as SIGSTOP, SIGCONT,a nd SIGKILL) to the remote process. This gives the job manager daemon full administrative control over all job processes directly from the master node. Job Management Daemon The BPROC API provides methods for creating remote processes given a node number, a command, and its arguments. However, from a job perspective, a job represents many processes located on many nodes and the job manager daemon hides behind a single job identifier which is returned to the user/scheduler. The job manager receives requests by binding to a socket and spawns a thread to handle each request. The job manager supports five requests: CreateJob For a job creation request, a thread is spawned and waits for the ghost process to complete, when it will notify the scheduler and update the database. QueryJob This is used as an immediate check on the resources associated with either a node or a job. The job manager first asks the database for the bproc process id of the resource manager associated with the node that the job is running on and then forces the resource manager to immediately check the resources on the node. The resource manager updates the status of the node in the database and the job manager returns the newly accumulated data back to the requestor. This request is an extension of the user resource manager library which adds the ability to look at real time resource data. KillJob,StopJob,ContJob The job manager sends a kill(2) sigkill/sigstop/sigcont signal to the process id specified and updates the database. Job Management User Library The job managerment user library can be used to send the appropriate message to the job manager : src/job_mgmt/user_job_mgmt.h Sample User Code Below is three sets of tests which tests creation a simple job, tests creation of a bad job, and make a complicated test with multiple hosts: src/job_mgmt/test_job_mngr.c Resource Management The resource management subsystem is a wrapper around reading system status and job status information from /proc . The resource managers are run on each node and are created as bproc processes. As a bproc process, there will be a ghost process on the master node which will forward all signals to the remote resource manager. Signals will be the communication mechanism between the job manager and the resource managers for forcing the resource managers to send immediate updates to the database. Resource Management Daemon The resource managers use an API provided by glibtop to access system status information. However for process status information, it needs to identify which processes are asssociated with bproc. BPROC stores the associated pid lists on a system device. However, only bproc can access this information, so the slave code was modified to add a thread which reads pids from a socket and returns the associated bproc process on the master (if success) or -1 on failure. Therefore the resource managers update the database for only bproc processes. Future funtionality of the resource managers might include responsibility for setting hard limits for resource consumption on a process, based on the resources originally requested. Another component of the resource managers might be to contact the job manager when a process overconsumes its resources. (The job manager would respond to such situations by sending a SIGKILL to the ghost process.) Central to the handling of resource requests and retrievals is an e2k_resource structure. It is tied to a particular class, instance, and method like the communication library below but it also contains a hash of key/value pairs corresponding to resource attributes for a node or pid and their values. Users can access information from e2k_resources by using the get method with an attribute to lookup, a char* is returned and the user must convert to the appropriate datatype. Resource Management User Library The user resource management library provides the ability for check resources on nodes or process ids either from historical data in the database (asynchrono usly), an immedate update, using the job manager, (synchronously), or to setup an update feature for any time the data changes in the database (which needs to be removed eventually). The attrlist argument to some of the functions is the list of resource information one wishes to retrieve. Basically the attrlist is used to create an e2k_resource structure which is then returned with data. src/resource_mgmt/user_resource_mgmt.h Sample User Code Below is a simple piece of code that changes an applications behavoir based on the resource available on a machine: src/resource_mgmt/test_resource_mgmt.c Embedded Database and Communication Backend The database server is a python (asyncore module) server which saves data in database files with a separate module called metakit and also using a cache in memory for efficiency. To communicate the addition, removal, querying, and updating of information in the database server a custom procotol has been designed. (Note: The database server and protocol design originated with previous completed work on the cluster.) The database is exported via a simple python metaclass framework, which bootstraps by creating the databases and tables themselves via this framework. The database classes have built-in constraint and type checking along with the ability to specify some arguments as mandatory and others as optional (default values are set for optional arguments not specified). The protocol is simple and dynamic enough to accomodate the needs of all subsystems involved: namely the scheduler, job management, resource management, and finally the database backend. The syntax for the protocol is as follows: Class - This is the classification of the data that is transmitted which corresponds to a python class on the server Instance Name - This is the identifier of this instance in the database Operation/Method - This the action for the database server to take which include: add, remove, set, get, create, delete Optional Arguments - These arguments are space delimited and are key value pairs which are used to handle the operation specified before. A sample query to the database might be: jobid 1000 set owner=kevin name=/bin/complain pids=1,2,3 start_time=454327545 where the class is 'jobid', the name is '1000', the method is 'set', and the optional arguments are 'owner=kevin name=/bin/complain pids=1,2,3 start_tim e=454327545' Dependencies Each of these new subsystems has a set of dependencies in order for the system to run. This is a simple developer's perspective on these dependencies: Resource Management BPROC slave process needs to be running as the resource manager querys the slave process to identify bproc masqueraded processes glibtop is the tool used to monitor the resources of the system custom database server must be running Job Management BPROC master process needs to be running. custom database server must be running glib library