======CONDOR====== This page is about our retired job scheduling system **CONDOR**. Check [[tutorial:torque|Torque]] to schedule jobs in Polyps. ===== What is CONDOR ===== CONDOR is a job manager to schedule computational jobs. Check [[https://researchcomputing.lehigh.edu/running-jobs/condor|the following link]] for an introduction to CONDOR. ===== Using CONDOR ===== ==== Submitting A Single Job ==== To submit a job via CONDOR, you need to create a .sub file. This .sub file must include a program that you will execute (e.g., matlab, cplex, etc.) along with the arguments for the program (such as your file to be executed). It's an automated way to run programs. === A case study: Matlab === Suppose that we want to run a MATLAB code on Polyps. Here is an example .sub file which submits the matlab file 'test.m' to condor for running and saves results of the code to 'out.txt' file, while CONDOR errors and logs are stored at 'error.txt' and 'log.txt', respectively. # Specify the executable software, i.e. mathematica, mosek, etc Executable = /usr/local/matlab/latest/bin/matlab Universe = vanilla getenv = true # Specify argument file arguments = -nosplash -nodesktop -logfile test.log -r test #request_cpus = 16 #request_memory = 2 # name output file output = ./out.txt # name error file error = ./error.txt #name log file log = ./log.txt transfer_executable = false # Submit to queue Queue After making sure all the files you specified exists in the correct directory, use\\ condor_submit myexp.sub to submit the file to condor.\\ You can find the "Executable" of a program by calling ''which //program//'' command. Frequently used executables on Polyps: * Matlab: /usr/local/matlab/latest/bin/matlab * Cplex: /usr/local/cplex/bin/x86-64_linux/cplex * Mosek: /usr/local/mosek/7.1/tools/platform/linux64x86/bin/mosek * Ampl: /usr/local/ampl/ampl ==== Submitting Multiple Jobs ==== There are multiple ways to submit a set of experiments (multiple jobs). Here we have two different ways to achieve the same result. === 1. Via Bash Script === A simple example to demonstrate the use of nested loops in multiple jobs submission In this example, the executable "test" takes two integers as the arguments(inputs) "test" is in the same directory as this submit file. One is running "test" with two paramentes i and j in the following nested loop for(int i=0; i< ilimit; ++i) { for(int j=0; j< jlimit; ++j) { test -i -j; } } The following example demonstrates a two-layer nestedloop with "ilimit=5, jlimit=10" Nested Loop with more than two layers can be achieved in the same logic getenv = TRUE Universe =vanilla ## ilimit=5, jlimit=10 ## N=(ilimit)*(jlimit)=50 ## ilimit is implicitly included in the "N" jlimit=10 N=50 I = $$([ $(Process) / $(jlimit) ]) J = $$([ $(Process) % $(jlimit) ]) Executable =test arguments= "$(I) $(J)" output=test$(Process).txt Error =test.err Log =test.log queue $(N) Output Correspondance ## test -i=0 -j=0 -> test0.txt ## test -i=0 -j=1 -> test1.txt ## ...... ## test -i=4 -j=9 -> test49.txt ---- A simple example to demonstrate the use of variables in multiple-job submission In this example, the executable "test" takes a single integer as the argument(input) "test" is in the same directory as this submit file. Executable "test" will be run 5 times with input 0 to 4, respectively. The corresponding output files are ''test0.txt'' to ''test4.txt'' ''$(Process)'' is a macro that supplies the process ID, 0 to 4 in this case. It could be used as an iteration counter getenv = TRUE Universe =vanilla Executable =test arguments= $(Process) output=test$(Process).txt Error =test.err Log =test.log queue 5 ## Executable "test" will be run with input 0 to 4 ## A variable N is defined to specify the number of jobs N=5 Executable =test arguments= $(Process) output=test$(Process).txt Error =test.err Log =test.log queue $(N) ## Executable "test" will be run with input 5 to 9 ## The corresponding output files are "test0.txt" to "test4.txt" ## Variable I is defined based on the macro $(Process) I=$$([ $(Process)+5]) Executable =test arguments= $(I) output=test$(Process).txt Error =test.err Log =test.log queue 5 === 2. Via Python (Script) === You can use the same executable, options, etc. and change some of them to create new jobs. Then when you submit your file using ''condor_submit'', it will put all of them at the same time. For your experiments, you can create a script to generate multiple jobs. Below, you will find an example Python script that generates multiple experiments with a changing argument. # This create.py script search the data folder and # create condor submission file (condor.sub) for same problem with different arguments # Open file and write common part cfile = open('condor.sub','w') common_command = \ 'Executable = ../test/portfolio \n\ Universe = vanilla\n\ getenv = true\n\ transfer_executable = false \n\n' cfile.write(common_command) # Loop over various values of an argument and create different output file for each # Then put it in the queue for a in xrange(5,8): run_command = \ 'arguments = -a %d\n\ output = out.%d.txt\n\ queue 1\n\n' %(a,a) cfile.write(run_command) This script will generate the following condor file Executable = ../test/portfolio Universe = vanilla getenv = true transfer_executable = false arguments = -a 5 output = out.5.txt queue 1 arguments = -a 6 output = out.6.txt queue 1 arguments = -a 7 output = out.7.txt queue 1 Be sure to provide output argument to your Condor submissions. Otherwise, you may not able to see results of your tasks. ==== Checking Jobs ==== To check the job progress, use command condor_q -global #this checks all the jobs on condor condor_q -run #this checks all running jobs condor_q userid #this checks all jobs under specific user name If you think somehow your jobs are not being processed, you can debug and see the reasons by calling ''condor_q //userid// -analyze'' command. ==== Removing Jobs ==== First find the ID of the job you will terminate condor_q userid Then call condor_rm ID Example: I call ''condor_q sec312'' to list all jobs belong to my username. This gives a list similar to this -- Submitter: polyp1.ie.lehigh.edu : <128.180.35.200:50671> : polyp1.ie.lehigh.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 42989.0 sec312 10/25 19:56 0+00:00:29 R 0 0.0 symphony -F air04. 42989.1 sec312 10/25 19:56 0+00:00:29 R 0 0.0 symphony -F air05. 42989.5 sec312 10/25 19:56 0+00:00:28 R 0 0.0 symphony -F dsbmip Now let say I want to terminate 42989.5. I call ''condor_rm 42989.5''. CONDOR confirms by saying \\ ''Job 42989.5 marked for removal'' You can remove all your jobs using command ''condor_rm username''. ===== Frequently Used CONDOR Commands ===== A summary of frequently used commands in CONDOR: ^ Command ^ Action ^ Basic Usage ^ Example ^ | ''condor_submit'' | submit a job | condor_submit [submit file] | $ condor_submit job.condor | | ''condor_q'' | show status of jobs | condor_q [cluster] | $ condor_q 1170 | | ''condor_rm'' | remove jobs from the queue | condor_rm [cluster] | $ condor_rm 1170 | | ''condor_rm //userid//'' | remove all jobs of user | | | [[http://www.rcc.uh.edu/hpc-docs/134-basic-condor-commands.html|Source]] ===== Some other CONDOR commands ===== ^ Command ^ Action ^ Info ^ | ''condor_userprio'' | shows the user priority | condor_userprio | | ''condor_status'' | show the current status of CONDOR nodes | ===== Running MPI Jobs with Condor ===== FIXME To submit MPI jobs to our condor pool you can check Dr. Takac's [[http://polyps.ie.lehigh.edu/mpi|MPI tutorial]] ===== Using AMPL with Condor ===== We have limited license of AMPL installed in COR@L Lab. The license allows at most 10 simultaneous AMPL jobs. If you are using AMPL in your experiments you can let condor know about this and it will schedule all jobs that needs AMPL considering the license limit. For this you should add the following line to your condor submission file. ''concurrency_limits = AMPL'' ===== Condor Jobs Memory Usage ===== Please check status of your condor jobs regularly, especially memory usage. Each polyp node has 16 processors and 32 GB memory. This means 1 process gets 2 GB memory in average. When a polyp node is out of memory it starts using hard drive (swap) as memory but reading and writing from hard drives is 1000 times slower. This means if your jobs are using large amounts of memory and the polyp node processing your job is out of memory, do not expect your job to terminate. Tips: You can see memory usage of your job using ''condor_q'' command (7th column gives memory usage in MB). You can check the node your job is running using ''condor_q -run'' You can check memory status in a node using ''ssh polyp6 "vmstat -s"''. For more memory checking commands see http://www.binarytides.com/linux-command-check-memory-usage/ or google is your friend. **Your job might get killed if it is using swap. Do not waste your system administrators' time with this and force them to police the condor jobs. Just control your jobs and submit jobs that are reasonable.**