Submitting jobs to a cluster in Sun Grid Engine

by Forrest Sheng Bao http://fsbao.net

I am only talking about SERIAL jobs! Parallel jobs may not work!

Texas Tech deployed a new cluster Grendel, the 289th fastest computer in the world, this October. Recently, I am doing a research using boosting algorithms with SVM. So, i need to do lotta cross-validation computing. I plan to train weak classifiers independently on different cores. So there are a dozen serial jobs. Grendel's job is managed by Sun Grid Engine, which will distribute jobs to nodes in the cluster. After reading Sun Grid Engine documentation and with the help of some friends, I figured out how to submit jobs.

The job receptor is called qsub. You need to write few lines like this (I prefer using a script, which will show the convenience later):

#!/bin/sh
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N nwchem
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q normal
#$ -pe fill 8
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 2048 -g 0.5 1eq.scale.txt 1>1eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 128 -g 2 2eq.scale.txt 1>2eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 32 -g 2 3eq.scale.txt 1>3eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 32 -g 2 4eq.scale.txt 1>4eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 128 -g 2 5eq.scale.txt 1>5eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 2048 -g 0.5 6eq.scale.txt 1>6eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 32 -g 2 7eq.scale.txt 1>7eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 32 -g 2 8eq.scale.txt 1>8eq.out 2>&1" &
run=8
while [ "$run" -ge 1 ]
do
sleep 60
run=`ps ux | grep $HOSTNAME | grep -v grep | wc -l`
done

The last 8 lines are jobs need to do. Use qsub your_script to submit jobs. Then you can see 8 processes running on that node:

top - 04:10:38 up 11 days, 12:39,  1 user,  load average: 8.13, 5.50, 2.47
Tasks: 256 total, 9 running, 247 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.8%us, 0.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16431924k total, 2307148k used, 14124776k free, 219332k buffers
Swap: 1020116k total, 0k used, 1020116k free, 971428k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11210 bao 25 0 120m 107m 1044 R 100 0.7 0:37.68 svm-train
11218 bao 25 0 120m 108m 1044 R 100 0.7 0:37.68 svm-train
11245 bao 25 0 120m 108m 1044 R 100 0.7 0:37.46 svm-train
11255 bao 25 0 122m 110m 1044 R 100 0.7 0:37.52 svm-train
11221 bao 25 0 121m 108m 1044 R 100 0.7 0:37.68 svm-train
11237 bao 25 0 120m 108m 1044 R 100 0.7 0:37.60 svm-train
11240 bao 25 0 120m 108m 1044 R 100 0.7 0:37.35 svm-train
11217 bao 25 0 121m 109m 1044 R 98 0.7 0:37.17 svm-train
5603 sge 15 0 14608 2140 1544 S 2 0.0 0:10.77 sge_execd
11444 bao 15 0 12716 1176 792 R 0 0.0 0:00.15 top
1 root 15 0 10312 628 532 S 0 0.0 0:01.91 init

There are some other necessary options, like -q ( to submit your job to which queue). You can refer Sun's document. http://wikis.sun.com/display/GridEngine/Grid+Engine

To see the status of your jobs, simply type qstat. If it is running, you will see something like this:

grendel:bao:/libsvm/10fold$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
508 0.56000 ten.1-8 bao r 12/29/2008 17:37:12 development@compute-7-3.local 8

You will see two files by default, like ten.1-8.o493 and ten.1-8.e493, which are standard output and error files. If you open the *.o* file, you can see things which are expected to be shown in the shell. Of course, clusters use something like NFS, so don't worry about files generated by your own program.

To kill a job, just type qdel followed by your job ID.

There is a GUI interface called qmon. You can monitor all jobs in a graphic window. Awesome! I feel like that I am a commander of an army. And if you see a job on pending status too long, you can click "Why?" button and check what the problem is. You can even submit jobs using this GUI and store all parameters into a file to use later.

No comments: