User Tools

Site Tools


Sidebar


Wiki

Info / Resources

Guides

Software

Sample Pages

Quick Navigation

tutorial:torque

Torque

TORQUE provides control over batch jobs and distributed computing resources. It is an advanced open-source product based on the original PBS project and incorporates the best of both community and professional development. It incorporates significant advances in the areas of scalability, reliability, and functionality and is currently in use at tens of thousands of leading government, academic, and commercial sites throughout the world. TORQUE may be freely used, modified, and distributed under the constraints of the included license.

Prerequisite

In order to extract your output and error results in Torque, you need to have password-less connection between nodes. If you have not set it once, execute the following commands. These commands create a public and private key so that when a node want to transfer a file to your home folder, it does not require the password. After connecting to polyps enter:

ssh-keygen -N ""

Then just press ENTER for any question. After that type the following commands:

touch ~/.ssh/authorized_keys2
chmod 600 ~/.ssh/authorized_keys2
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2

Now, you will get the error log and output log files for your jobs.

Hardware

We have 16 nodes

Nodes CPUs Memory Notes
polyp1–polyp15 16 AMD Opteron™ Processor 6128 32 GB
polyp30 24 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz 128 GB 2x K80 (4GPUs)

Configured Resources as provided in the Maui scheduler. This is pulled from Torque:

                      PROCS: 16  
                      MEM: 31G  
                      SWAP: 63G  

Submitting Jobs

Jobs can be submitted either using a submission file or directly from command line. First we explain how it is done and then we will discuss the options.

Using submission script

We will create a file test.pbs

test.pbs
#PBS -N JobName
#PBS -e /home/mat614/TEST.err
#PBS -o /home/mat614/TEST.out
#PBS -l nodes=1:ppn=4 
#PBS -l pmem=2GB:vmem=1GB
#PBS -q batch
 
cd /home/mat614/
./test_code
sleep 60

First few lines contains settings for the job. This is followed by commands for running a particular job. The job can be submitted by running qsub test.pbs

Directly submitting job

You do not need to write submission script. However, you can submit only bash script by default. Let's create a file myscript.sh which contains following

myscript.sh
cd /home/mat614/
./test_code

If you do not want to write the submission script you can do it just by calling

qsub -N JobName -q batch -l nodes=1:ppn=2  myscript.sh

Now, we will run the code but we are setting the job parameters using - character (e.g. -N JobName)

Options

Option Description
-q <queue> Set the queue. Often you will use the standard queue, so no need to set this up.
-V Will pass all environment variables to the job
-v var[=value] Will specifically pass environment variable 'var' to the job
-b y Allow command to be a binary file instead of a script.
-w e Verify options and abort if there is an error
-N <jobname> Name of the job. This you will see when you use qstat, to check status of your jobs.
-l resource_list Specify resources
-l h_rt=<hh:mm:ss> Specify the maximum run time (hours, minutes and seconds)
-l s_rt=hh:mm:ss Specify the soft run time limit (hours, minutes and seconds) - Remember to set both s_rt and h_rt.
-cwd Run in current working directory
-wd <dir> Set working directory for this job as <dir>
-o <output_logfile> Name of the output log file
-e <error_logfile> Name of the error log file
-m ea Will send email when job ends or aborts
-P <projectName> Set the job's project
-M <emailaddress> Email address to send email to
-t <start>-<end>:<incr> Submit a job array with start index , stop index in increments using

You can find detailed information here.

You need to use option -V to pass environment variables, which is needed to run solvers such as (Cplex, Gurobi, MOSEK, etc..). See here.

Monitoring and Removing jobs

To show the jobs use qstat or qstat -a You can also see more details using

qstat -f

To show jobs of some user use qstat -u “mat614” To remove job use

qdel JOB_ID

Moreover, you can use the following command:

qstat -r : provides the list of the running jobs
qstat -i : provides the list of the jobs which are in queue
qstat -n : provides the polyps node(s) which are executing each job

Queues

We have few queues qstat -Q

Queue              Max    Tot   Ena   Str   Que   Run   Hld   Wat   Trn   Ext T   Cpt
----------------   ---   ----    --    --   ---   ---   ---   ---   ---   --- -   ---
MOSEK               48      0   yes   yes     0     0     0     0     0     0 E     0
AMPL                 8      0   yes   yes     0     0     0     0     0     0 E     0
long                30      1   yes   yes     0     0     0     0     0     0 E     0
gpu                  4      0   yes   yes     0     0     0     0     0     0 E     0
verylong            20      0   yes   yes     0     0     0     0     0     0 E     0
medium             100      0   yes   yes     0     0     0     0     0     0 E     0
coraverylong         0      0    no    no     0     0     0     0     0     0 E     0
special             24      0   yes   yes     0     0     0     0     0     0 E     0
batch                0      1   yes   yes     0     0     0     0     0     0 E     0
short                0      0   yes   yes     0     0     0     0     0     0 E     0
urgent               0      0    no    no     0     0     0     0     0     0 E     0
background           0      0   yes   yes     0     0     0     0     0     0 E     0
mediumlong          60      0   yes   yes     0     0     0     0     0     0 E     0

If you want to use AMPL or MOSEK, you have to use queue: AMPL or MOSEK, because we have limited licenses for them.

You can see limits using this command qstat -f -Q

Queue Wall Time Max Queueable Max Running Max User run Max User Queuable Notes
urgent high priority - upon request
batch 01:00:00
short 02:00:00
medium 04:00:00 100 40 200
mediumlong 24:00:00 1200 60
long 72:00:00 30 20 900
verylong 240:00:00 20 10 600
special 72:00:00 24
background unlimited low priority
gpu 4 1 GPU node is not in Torque
AMPL 200 8 6
MOSEK 50 48

Notes:

  • Urgent queue has no limits and jobs have a higher priority over all other jobs in the queues. Please be respectful of others if using this queue to complete time sensitive or critical jobs.
  • background queue has no limits and jobs have a lower priority over all other jobs in the queues.

Examples

Submitting a Small or Large Memory Job

You can use the option -l pmem=size,vmem=size to limit memory usage of your job.

limited.sh
qsub -l pmem=4gb,vmem=4gb test.pbs

Sometimes your job needs more memory. You can choose a larger memory size with the same option:

large.pbs
qsub  -l pmem=20gb  test.pbs

To see what resources have been assigned by the batch queuing system run the ulimit command (bash) or limit comamnd:

pbs job submission command
qsub -I -l nodes=1:ppn=1 -l pmem=30GB:vmem=4GB -q short -N test -e TEST.err -o TEST.out -w e
ulimit
user@polyp13:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) 31457280
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128344
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 31457280
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 128344
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

For more information on the ulimit command review this link.

Running MATLAB

You just have to create a submission job which looks like this

submission.pbs
#PBS -N JobName
#PBS -e /home/mat614/TEST.err
#PBS -o /home/mat614/TEST.out
#PBS -l nodes=1:ppn=4 
#PBS -l pmem=2GB:vmem:1GB
#PBS -q batch
 
/usr/local/matlab/latest/bin/matlab -nosplash -nodesktop  MY_MATLAB_SCRIPT.m
Use -singleCompThread option for Matlab to use a single thread. A similar option may be needed for the program/solver you're using.

Running Solvers

In order to run solvers (such as Gurobi/CPLEX/Mosek/AMPL/…), you need to use “-V” (it is Upper case) option. i.e.:

qsub -V submitFile.pbs 

This flag enables the solver to find necessary authentication information.

Interactive Jobs

If you do not care where you run your job just use -I and do not specify any script to run.

qsub -I

If you want to run your job on particular node just use -l nodes=polyp15

qsub -l nodes=polyp15  -I

and you will be running interactive session on polyp15.

Using GPU's

If you want to use GPU you have to request that resource

qsub -I -l nodes=1:ppn=1:gpus=1:default

However, first you have to have a permission to use GPU (given by Prof. Takac) – this is just formality to allow to certain users to use video driver on polyp30

If you are using TensorFlow in Python, you can set the limit on amount of GPU memory using:

config_tf = tf.ConfigProto()
config_tf.gpu_options.per_process_gpu_memory_fraction = p

in which p is the percent of GPU memory (a number between zero and one).

Running MPI and Parallel Jobs

mpi.pbs
# declare a name for this job to be sample_job
#PBS -N my_parallel_job
# request a total of 4 processors for this job (2 nodes and 2 processors per node)
#PBS -l nodes=2:ppn=2
# request 4 hours of cpu time
#PBS -l cput=0:04:00
# combine PBS standard output and error files
#PBS -j oe
# mail is sent to you when the job starts and when it terminates or aborts
#PBS -m bea
# specify your email address
#PBS -M John.Smith@dartmouth.edu
#change to the directory where you submitted the job
cd $PBS_O_WORKDIR
#include the full path to the name of your MPI program
mpirun -machinefile $PBS_NODEFILE -np 4 /path_to_executable/program_name
exit 0

Allocating more than one CPU under PBS can be done in a number of ways, using the -l flag and the following resource descriptions:

  • nodes - specifies the number of separate nodes that should be allocated
  • ncpus - how many cpus each allocated node must have
  • ppn - how many processes to allocate for each node

The allocation made by pbs will be reflected in the contents of the nodefile, which can be accessed via the $PBS_NODEFILE environment variable.

The difference between ncpus and ppn is a bit subtle. ppn is used when you actually want to allocate multiple processes per node. ncpus is used to qualify the sort of nodes you want, and only secondarily to allocate multiple slots on a cpus. Some examples should help.

qsub -lnodes=2

would allocate 2 nodes, one process each. The nodefile would have two entries. Note that on bulldogc, the two entries might actually be on the same physical node, since PBS knows that each physical node has two cpus. So the nodefile might look like:

c1
c1

or

c1
c2
qsub -lnodes=2:ppn=2

would allocate 2 nodes, two processes each. The nodefile would have 4 entries, with 2 nodes listed twice each:

c1
c1
c2
c2

Contrast this to:

qsub -lnodes=2:ncpus=2

which would allocate 2 nodes that have the property that they contain two cpus. The nodefile would have 2 entries:

c1
c2

Mass Operations

Submitting multiple jobs

An easy way to submit multiple jobs via PBS is using a batch script. Suppose we would like to give all file names inside a folder with MPS extension into our solver. We can write a PBS Script such as

submit.pbs
cd /home/sec312/
/usr/local/cplex/bin/x86-64_linux/cplex ${FILENAME}

and a BASH script:

bashloop.sh
for f in dataset/*.mps
do
    qsub -q batch -v FILENAME=$f submit.pbs
done

Here, option -v passes all arguments (FILENAME in our example'') that we define into PBS file. You can submit several arguments by separating them with commas. DON'T use space between arguments.

After having these two files, simply calling

./bashloop.sh

will submit all jobs into Torque.

Cancelling all jobs

You can call

qselect -u <username> -s R | xargs qdel

to cancel all of your running jobs.

qselect -u <username> | xargs qdel

will cancel all jobs (both running jobs and queue).

Advanced

The qsub command will pass certain environment variables in the Variable_List attribute of the job. These variables will be available to the job. The value for the following variables will be taken from the environment of the qsub command:

  • HOME (the path to your home directory)
  • LANG (which language you are using)
  • LOGNAME (the name that you logged in with)
  • PATH (standard path to excecutables)
  • MAIL (location of the users mail file)
  • SHELL (command shell, i.e bash,sh,zsh,csh, ect.)
  • TZ (time zone)

These values will be assigned to a new name which is the current name prefixed with the string “PBS_O_”. For example, the job will have access to an environment variable named PBS_O_HOME which have the value of the variable HOME in the qsub command environment. In addition to these standard environment variables, there are additional environment variables available to the job.

  • PBS_O_HOST (the name of the host upon which the qsub command is running)
  • PBS_SERVER (the hostname of the pbs_server which qsub submits the job to)
  • PBS_O_QUEUE (the name of the original queue to which the job was submitted)
  • PBS_O_WORKDIR (the absolute path of the current working directory of the qsub command)
  • PBS_ARRAYID (each member of a job array is assigned a unique identifier)
  • PBS_ENVIRONMENT (set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job)
  • PBS_JOBID (the job identifier assigned to the job by the batch system)
  • PBS_JOBNAME (the job name supplied by the user)
  • PBS_NODEFILE (the name of the file contain the list of nodes assigned to the job)
  • PBS_QUEUE (the name of the queue from which the job was executed from)
  • PBS_WALLTIME (the walltime requested by the user or default walltime allotted by the scheduler)

Tensorflow with GPU

To use tensorflow with a specific GPU, say GPU 1, you can simply set

export CUDA_VISIBLE_DEVICES=1

and then schedule your jobs with Torque to perform experiments on GPU 1.

MOAB Scheduler

PBS Torque is used to schedule and run jobs on our cluster. Two PBS processes are required to run jobs. On the PBS server, the pbs_server process runs to accept your job and add it to the queue. It will also dispatch the job to the nodes to run under the pbs_mom process.

Useful MOAB Commands

1. showq - Displays information about active, eligible, blocked, and/or recently completed jobs.

2. showstart - Displays the estimated start time of a job based a number of analysis types.

3. checkjob - Allows end users to view the status of their own jobs.

Useful External Resources

MSU -Understand job scheduler and resource manager - Describes the batch queuing system and has some useful diagrams explaining the interrelationship between the scheduler and the server.

WVU - Job Submission (Torque and Moab) - Lists frequently used commands for Torque and Moab. Also includes information on Prologue and Epilogue scripts.

Moab-TORQUE/PBS Integration Guide - Guide for Administrators and integrators on the deployment and integration of PBS Torque and Moab into a computer system

Torque Notes - Information about the processes involved in using torque and debugging information.

tutorial/torque.txt · Last modified: 2024/02/28 13:12 by mjm519