Think, Forrest! Think!: Submitting jobs to a cluster in Sun Grid Engine

I am only talking about SERIAL jobs! Parallel jobs may not work!

Texas Tech deployed a new cluster Grendel, the 289th fastest computer in the world, this October. Recently, I am doing a research using boosting algorithms with SVM. So, i need to do lotta cross-validation computing. I plan to train weak classifiers independently on different cores. So there are a dozen serial jobs. Grendel's job is managed by Sun Grid Engine, which will distribute jobs to nodes in the cluster. After reading Sun Grid Engine documentation and with the help of some friends, I figured out how to submit jobs.

The job receptor is called qsub. You need to write few lines like this (I prefer using a script, which will show the convenience later):

#!/bin/sh
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N nwchem
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q normal
#$ -pe fill 8
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 2048 -g 0.5 1eq.scale.txt  1>1eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 128  -g 2   2eq.scale.txt  1>2eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 32   -g 2   3eq.scale.txt  1>3eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 32   -g 2   4eq.scale.txt  1>4eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 128  -g 2   5eq.scale.txt  1>5eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 2048 -g 0.5 6eq.scale.txt  1>6eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 32   -g 2   7eq.scale.txt  1>7eq.out 2>&1" &
ssh $HOSTNAME "cd $PWD;/home/bao/libsvm/svm-train -v 14872 -c 32   -g 2   8eq.scale.txt  1>8eq.out 2>&1" &
run=8
while [ "$run" -ge 1 ]
do
sleep 60
run=`ps ux | grep $HOSTNAME | grep -v grep | wc -l`
done

The last 8 lines are jobs need to do. Use qsub your_script to submit jobs. Then you can see 8 processes running on that node:

top - 04:10:38 up 11 days, 12:39,  1 user,  load average: 8.13, 5.50, 2.47
Tasks: 256 total,   9 running, 247 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.8%us,  0.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16431924k total,  2307148k used, 14124776k free,   219332k buffers
Swap:  1020116k total,        0k used,  1020116k free,   971428k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                     
11210 bao      25   0  120m 107m 1044 R  100  0.7   0:37.68 svm-train                                                   
11218 bao      25   0  120m 108m 1044 R  100  0.7   0:37.68 svm-train                                                   
11245 bao      25   0  120m 108m 1044 R  100  0.7   0:37.46 svm-train                                                   
11255 bao      25   0  122m 110m 1044 R  100  0.7   0:37.52 svm-train                                                   
11221 bao      25   0  121m 108m 1044 R  100  0.7   0:37.68 svm-train                                                   
11237 bao      25   0  120m 108m 1044 R  100  0.7   0:37.60 svm-train                                                   
11240 bao      25   0  120m 108m 1044 R  100  0.7   0:37.35 svm-train                                                   
11217 bao      25   0  121m 109m 1044 R   98  0.7   0:37.17 svm-train                                                   
5603 sge       15   0 14608 2140 1544 S    2  0.0   0:10.77 sge_execd                                                   
11444 bao      15   0 12716 1176  792 R    0  0.0   0:00.15 top                                                         
1 root      15   0 10312  628  532 S    0  0.0   0:01.91 init

There are some other necessary options, like -q ( to submit your job to which queue). You can refer Sun's document. http://wikis.sun.com/display/GridEngine/Grid+Engine

To see the status of your jobs, simply type qstat. If it is running, you will see something like this:

grendel:bao:/libsvm/10fold$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
508 0.56000 ten.1-8    bao         r     12/29/2008 17:37:12 development@compute-7-3.local      8

You will see two files by default, like ten.1-8.o493 and ten.1-8.e493, which are standard output and error files. If you open the *.o* file, you can see things which are expected to be shown in the shell. Of course, clusters use something like NFS, so don't worry about files generated by your own program.

To kill a job, just type qdel followed by your job ID.

There is a GUI interface called qmon. You can monitor all jobs in a graphic window. Awesome! I feel like that I am a commander of an army. And if you see a job on pending status too long, you can click "Why?" button and check what the problem is. You can even submit jobs using this GUI and store all parameters into a file to use later.

Think, Forrest! Think!

2008-12-29

Submitting jobs to a cluster in Sun Grid Engine

No comments: