This shows you the differences between two versions of the page.
Next revision | Previous revision Last revision Both sides next revision | ||
tutorial:torque [2016/10/07 09:14] sertalpbilal created |
tutorial:torque [2024/02/28 13:06] mjm519 [Table] |
||
---|---|---|---|
Line 2: | Line 2: | ||
TORQUE provides control over batch jobs and distributed computing resources. It is an advanced open-source product based on the original PBS project and incorporates the best of both community and professional development. It incorporates significant advances in the areas of scalability, | TORQUE provides control over batch jobs and distributed computing resources. It is an advanced open-source product based on the original PBS project and incorporates the best of both community and professional development. It incorporates significant advances in the areas of scalability, | ||
+ | |||
+ | |||
+ | |||
+ | ===== Prerequisite ===== | ||
+ | In order to extract your output and error results in Torque, you need to have password-less connection between nodes. If you have not set it once, execute the following commands. These commands create a public and private key so that when a node want to transfer a file to your home folder, it does not require the password. | ||
+ | After connecting to polyps enter: | ||
+ | |||
+ | <code bash> | ||
+ | ssh-keygen -N "" | ||
+ | </ | ||
+ | |||
+ | Then just press ENTER for any question. After that type the following commands: | ||
+ | |||
+ | <code bash> | ||
+ | touch ~/ | ||
+ | chmod 600 ~/ | ||
+ | cat ~/ | ||
+ | </ | ||
+ | Now, you will get the error log and output log files for your jobs. | ||
+ | |||
+ | |||
+ | |||
===== Hardware ===== | ===== Hardware ===== | ||
Line 9: | Line 31: | ||
| polyp1--polyp15 | | polyp1--polyp15 | ||
| polyp30 | 24 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz | 128 GB | 2x K80 (4GPUs) | | | polyp30 | 24 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz | 128 GB | 2x K80 (4GPUs) | | ||
+ | |||
+ | |||
+ | Configured Resources as provided in the Maui scheduler. This is pulled from Torque: | ||
+ | PROCS: 16 | ||
+ | MEM: 31G | ||
+ | SWAP: 63G | ||
===== Submitting Jobs ===== | ===== Submitting Jobs ===== | ||
Line 22: | Line 50: | ||
#PBS -o / | #PBS -o / | ||
#PBS -l nodes=1: | #PBS -l nodes=1: | ||
+ | #PBS -l pmem=2GB: | ||
#PBS -q batch | #PBS -q batch | ||
Line 39: | Line 68: | ||
</ | </ | ||
If you do not want to write the submission script you can do it just by calling | If you do not want to write the submission script you can do it just by calling | ||
- | < | + | < |
Now, we will run the code but we are setting the job parameters using '' | Now, we will run the code but we are setting the job parameters using '' | ||
- | ===== Important | + | ===== Options ===== |
+ | |||
+ | ^ Option | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | | '' | ||
+ | |||
+ | You can find detailed information [[http:// | ||
+ | |||
+ | <note tip>You need to use option '' | ||
+ | ===== Monitoring and Removing jobs ===== | ||
+ | |||
+ | To show the jobs use '' | ||
+ | < | ||
+ | To show jobs of some user use '' | ||
+ | <code shell> | ||
+ | qdel JOB_ID | ||
+ | </ | ||
+ | |||
+ | Moreover, you can use the following command: | ||
+ | < | ||
+ | < | ||
+ | < | ||
+ | ==== Queues ==== | ||
+ | |||
+ | We have few queues '' | ||
+ | < | ||
+ | Queue Max Tot | ||
+ | ---------------- | ||
+ | MOSEK | ||
+ | AMPL | ||
+ | long 30 1 | ||
+ | gpu 4 0 | ||
+ | verylong | ||
+ | medium | ||
+ | coraverylong | ||
+ | special | ||
+ | batch 0 1 | ||
+ | short 0 0 | ||
+ | urgent | ||
+ | background | ||
+ | mediumlong | ||
+ | </ | ||
+ | |||
+ | If you want to use AMPL or MOSEK, you have to use queue: AMPL or MOSEK, because we have limited licenses for them. | ||
+ | |||
+ | |||
+ | |||
+ | You can see limits using this command '' | ||
+ | ^ Queue ^ Wall Time ^ Max Queueable | ||
+ | | urgent | ||
+ | | batch | 01: | ||
+ | | short | 02: | ||
+ | | medium | ||
+ | | mediumlong | ||
+ | | long | 72: | ||
+ | | verylong | ||
+ | | special | ||
+ | | background | ||
+ | | gpu | ||
+ | | AMPL | | | 8 | 6 | ||
+ | | MOSEK | ||
+ | |||
+ | |||
+ | |||
+ | Notes: | ||
+ | * Urgent queue has no limits and jobs have a higher priority over all other jobs in the queues. Please be respectful of others if using this queue to complete time sensitive or critical jobs. | ||
+ | * background queue has no limits and jobs have a lower priority over all other jobs in the queues. | ||
+ | ===== Examples ===== | ||
+ | |||
+ | ==== Submitting a Small or Large Memory Job ==== | ||
+ | |||
+ | You can use the option '' | ||
+ | |||
+ | <code bash limited.sh> | ||
+ | qsub -l pmem=4gb, | ||
+ | </ | ||
+ | |||
+ | Sometimes your job needs more memory. You can choose a larger memory size with the same option: | ||
+ | |||
+ | <code bash large.pbs> | ||
+ | |||
+ | To see what resources have been assigned by the batch queuing system run the ulimit command (bash) or limit comamnd: | ||
+ | <code bash pbs job submission command> | ||
+ | <code bash ulimit> | ||
+ | core file size (blocks, -c) 0 | ||
+ | data seg size | ||
+ | scheduling priority | ||
+ | file size | ||
+ | pending signals | ||
+ | max locked memory | ||
+ | max memory size | ||
+ | open files (-n) 65536 | ||
+ | pipe size (512 bytes, -p) 8 | ||
+ | POSIX message queues | ||
+ | real-time priority | ||
+ | stack size (kbytes, -s) unlimited | ||
+ | cpu time | ||
+ | max user processes | ||
+ | virtual memory | ||
+ | file locks (-x) unlimited</ | ||
+ | |||
+ | **[[https:// | ||
+ | ==== Running MATLAB ==== | ||
+ | |||
+ | You just have to create a submission job which looks like this | ||
+ | <code bash submission.pbs> | ||
+ | #PBS -N JobName | ||
+ | #PBS -e / | ||
+ | #PBS -o / | ||
+ | #PBS -l nodes=1: | ||
+ | #PBS -l pmem=2GB: | ||
+ | #PBS -q batch | ||
+ | |||
+ | / | ||
+ | </ | ||
+ | |||
+ | <note tip>Use **-singleCompThread** [[https:// | ||
+ | |||
+ | ==== Running Solvers ==== | ||
+ | |||
+ | In order to run solvers (such as Gurobi/ | ||
+ | |||
+ | < | ||
+ | |||
+ | This flag enables the solver to find necessary authentication information. | ||
+ | |||
+ | ==== Interactive Jobs ==== | ||
+ | |||
+ | If you do not care where you run your job just use '' | ||
+ | < | ||
+ | |||
+ | If you want to run your job on particular node just use '' | ||
+ | < | ||
+ | and you will be running interactive session on polyp15. | ||
+ | |||
+ | ==== Using GPU's ==== | ||
+ | |||
+ | |||
+ | If you want to use GPU you have to request that resource | ||
+ | < | ||
+ | |||
+ | However, first you have to have a permission to use GPU (given by Prof. Takac) -- this is just formality to allow to certain users to use video driver on polyp30 | ||
+ | |||
+ | If you are using TensorFlow in Python, you can set the limit on amount of GPU memory using: | ||
+ | < | ||
+ | config_tf.gpu_options.per_process_gpu_memory_fraction = p</ | ||
+ | in which **//p//** is the percent of GPU memory (a number between zero and one). | ||
+ | |||
+ | ==== Running MPI and Parallel Jobs ==== | ||
+ | |||
+ | <code bash mpi.pbs> | ||
+ | # declare a name for this job to be sample_job | ||
+ | #PBS -N my_parallel_job | ||
+ | # request a total of 4 processors for this job (2 nodes and 2 processors per node) | ||
+ | #PBS -l nodes=2: | ||
+ | # request 4 hours of cpu time | ||
+ | #PBS -l cput=0: | ||
+ | # combine PBS standard output and error files | ||
+ | #PBS -j oe | ||
+ | # mail is sent to you when the job starts and when it terminates or aborts | ||
+ | #PBS -m bea | ||
+ | # specify your email address | ||
+ | #PBS -M John.Smith@dartmouth.edu | ||
+ | #change to the directory where you submitted the job | ||
+ | cd $PBS_O_WORKDIR | ||
+ | #include the full path to the name of your MPI program | ||
+ | mpirun -machinefile $PBS_NODEFILE -np 4 / | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | Allocating more than one CPU under PBS can be done in a number of ways, using the '' | ||
+ | |||
+ | * nodes - specifies the number of separate nodes that should be allocated | ||
+ | * ncpus - how many cpus each allocated node must have | ||
+ | * ppn - how many processes to allocate for each node | ||
+ | |||
+ | The allocation made by pbs will be reflected in the contents of the nodefile, which can be accessed via the '' | ||
+ | |||
+ | The difference between ncpus and ppn is a bit subtle. ppn is used when you actually want to allocate multiple processes per node. ncpus is used to qualify the sort of nodes you want, and only secondarily to allocate multiple slots on a cpus. Some examples should help. | ||
+ | |||
+ | < | ||
+ | would allocate 2 nodes, one process each. The nodefile would have two entries. Note that on bulldogc, the two entries might actually be on the same physical node, since PBS knows that each physical node has two cpus. So the nodefile might look like: | ||
+ | < | ||
+ | c1 | ||
+ | c1 | ||
+ | </ | ||
+ | or | ||
+ | < | ||
+ | c1 | ||
+ | c2 | ||
+ | </ | ||
+ | < | ||
+ | qsub -lnodes=2: | ||
+ | </ | ||
+ | would allocate 2 nodes, two processes each. The nodefile would have 4 entries, with 2 nodes listed twice each: | ||
+ | < | ||
+ | c1 | ||
+ | c1 | ||
+ | c2 | ||
+ | c2 | ||
+ | </ | ||
+ | Contrast this to: | ||
+ | < | ||
+ | qsub -lnodes=2: | ||
+ | </ | ||
+ | which would allocate 2 nodes that have the property that they contain two cpus. The nodefile would have 2 entries: | ||
+ | < | ||
+ | c1 | ||
+ | c2 | ||
+ | </ | ||
+ | |||
+ | ===== Mass Operations ===== | ||
+ | |||
+ | ==== Submitting multiple jobs ==== | ||
+ | An easy way to submit multiple jobs via PBS is using a batch script. Suppose we would like to give all file names inside a folder with MPS extension into our solver. We can write a PBS Script such as | ||
+ | <code bash submit.pbs> | ||
+ | cd / | ||
+ | / | ||
+ | </ | ||
+ | and a BASH script: | ||
+ | <code bash bashloop.sh> | ||
+ | for f in dataset/ | ||
+ | do | ||
+ | qsub -q batch -v FILENAME=$f submit.pbs | ||
+ | done | ||
+ | </ | ||
+ | Here, option '' | ||
+ | |||
+ | After having these two files, simply calling | ||
+ | < | ||
+ | ./ | ||
+ | </ | ||
+ | will submit all jobs into Torque. | ||
+ | |||
+ | ==== Cancelling all jobs ==== | ||
+ | You can call | ||
+ | <code bash> | ||
+ | qselect -u < | ||
+ | </ | ||
+ | to cancel all of your running jobs. | ||
+ | |||
+ | <code bash> | ||
+ | qselect -u < | ||
+ | </ | ||
+ | will cancel all jobs (both running jobs and queue). | ||
+ | |||
+ | |||
+ | ===== Advanced ===== | ||
+ | |||
+ | |||
+ | The qsub command will pass certain environment variables in the Variable_List attribute of the job. These variables will be available to the job. The value for the following variables will be taken from the environment of the qsub command: | ||
+ | * **HOME** (the path to your home directory) | ||
+ | * **LANG** (which language you are using) | ||
+ | * **LOGNAME** (the name that you logged in with) | ||
+ | * **PATH** (standard path to excecutables) | ||
+ | * **MAIL** (location of the users mail file) | ||
+ | * **SHELL** (command shell, i.e bash, | ||
+ | * **TZ** (time zone) | ||
+ | These values will be assigned to a new name which is the current name prefixed with the string " | ||
+ | * **PBS_O_HOST** (the name of the host upon which the qsub command is running) | ||
+ | * **PBS_SERVER** (the hostname of the pbs_server which qsub submits the job to) | ||
+ | * **PBS_O_QUEUE** (the name of the original queue to which the job was submitted) | ||
+ | * **PBS_O_WORKDIR** (the absolute path of the current working directory of the qsub command) | ||
+ | * **PBS_ARRAYID** (each member of a job array is assigned a unique identifier) | ||
+ | * **PBS_ENVIRONMENT** (set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job) | ||
+ | * **PBS_JOBID** (the job identifier assigned to the job by the batch system) | ||
+ | * **PBS_JOBNAME** (the job name supplied by the user) | ||
+ | * **PBS_NODEFILE** (the name of the file contain the list of nodes assigned to the job) | ||
+ | * **PBS_QUEUE** (the name of the queue from which the job was executed from) | ||
+ | * **PBS_WALLTIME** (the walltime requested by the user or default walltime allotted by the scheduler) | ||
+ | |||
+ | |||
+ | ==== Tensorflow with GPU ==== | ||
+ | To use tensorflow with a specific GPU, say GPU 1, you can simply set | ||
+ | <code bash> | ||
+ | export CUDA_VISIBLE_DEVICES=1 | ||
+ | </ | ||
+ | and then schedule your jobs with Torque to perform experiments on GPU 1. | ||
+ | |||
+ | |||
+ | ====== MOAB Scheduler ====== | ||
+ | PBS Torque is used to schedule and run jobs on our cluster. Two PBS processes are required to run jobs. On the PBS server, the pbs_server process runs to accept your job and add it to the queue. It will also dispatch the job to the nodes to run under the pbs_mom process. | ||
+ | |||
+ | |||
+ | ==== Useful MOAB Commands ==== | ||
+ | 1. [[https:// | ||
+ | |||
+ | 2. [[https:// | ||
+ | |||
+ | 3. [[https:// | ||
+ | |||
+ | ====Useful External Resources==== | ||
+ | [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | [[http:// | ||
- | * '' | + | [[https:// |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | See [[http:// | ||