DAGMan
DAGMan is a HTCondor tool that allows multiple jobs to be organized in workflows. A DAGMan workflow automatically submits jobs in a particular order, such that certain jobs need to complete before others start running.
Once you are familiar with how to create, submit, and monitor HTCondor jobs, creating DAGMan workflows is relatively easy. The official documentation describes comprehensively the overall structure and available scripting of the dag-file.
A simple dag file consists of a list of nodes (which are jobs plus optional pre- and post-processing scripts). In addition, their relationship can be specified via PARENT JobName CHILD JobName structures.
Example
A simple use-case for DAGMan is when wanting to run a set of jobs one after another, without having to submit each job manually once the previous one finishes (for example, when importing dicoms using datalad hirni-import-dcm or preprocessing multiple subjects using fmriPrep in sequence - both of these do not run well in parallel at the time of writing).
To achieve this two files are needed: a submit-file (e.g. my_job.submit) and a dag-file (e.g. my_workflow.dag).
The submit-file is a regular HTCondor submit file, but in addition, it can have special variables which will be set via the dag file on submission:
#### The submit file - my_job.submit # The environment universe = vanilla getenv = True request_cpus = $(req_cpu) request_memory = $(req_mem) # Execution initial_dir = $ENV(HOME) executable = $ENV(HOME)/my_nature_worthy_analysis.sh # Job 1 # NOTE: arguments 2 and 3 are request_cpus and request_memory, respectively arguments = "$(subject) $(req_mem) $(req_mem)" log = $ENV(HOME)/logs/fortune_$(Cluster).$(Process).log output = $ENV(HOME)/logs/fortune_$(Cluster).$(Process).out error = $ENV(HOME)/logs/fortune_$(Cluster).$(Process).err Queue
In the above script, three additional, non-standard variables are included: req_cpu, req_mem, and subject. The first two are used to dynamically specify how many resources are needed for the job, as well as to be passed on to the analysis-script via the arguments of the job. The variable subject is only passed on to the analysis script.
In the associated dag-file, these variables can be set for all nodes (using VARS ALL_NODES ..) or for only specific nodes (VARS JobName ..). The example below shows how to set requirements for all jobs; the node names (a.k.a JobName 001, 002, and 003) are used to dynamically set the subject numbers.
#### my_workflow.dag JOB 001 my_job.submit JOB 002 my_job.submit JOB 003 my_job.submit VARS ALL_NODES req_mem="1000" VARS ALL_NODES req_cpu="2" VARS ALL_NODES subject="$(JOB)" CATEGORY ALL_NODES DummyCategory MAXJOBS DummyCategory 1
Finally, it is possible to limit the number of concurrently running jobs for each category of job. In this example, only one category (DummyCategory) is created and its MAXJOBS value is set to 1.
Submitting the dag-file using condor_submit_dag my_workflow.dag will add the workflow to condor's queue and execute all three jobs one after another, making sure that only one of them is running at a given time:
-- Schedd: medusa.local : <10.0.0.100:9618?... @ 02/11/20 20:58:05 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS pvavra my_workflow.dag+1898 2/11 19:57 _ 1 _ 3 1898912.0 2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended