Task, Job & Worker
- A task is a Linux command
- A number of tasks are grouped into a queue (see Stephen’s example)
- A job is a queue submitted as a batch job.
- Workers are the number of processes (process may not be the correct terminology) that can run simultaneously within a job.
To illustrate with an example:
We can type in the following qdo command on Edison: “qdo launch (2 jobs, 4 workers, 96 cores)”
Then what happens on Edison is the 2 jobs are submitted simultaneously through Edison command “qsub [jobname]”; the 96 cores are divided between the 4 workers so each worker gets 24 cores. Each worker independent grabs a task from the pending list from the two jobs and perform “qdo do” on the task, using the 24 cores (a.k.a one node) it has access to. Once a worker finishes a task, it goes back to the pending list of the queue and grab another task to perform. This happens recursively until the queue is complete.
The reason why we submit two jobs here to run simultaneously instead of one bigger job that combines the two jobs is, among other technical considerations, the fact that shorter jobs that require short walltime wait less in the NERSC queue.
What Happens When A Task Fails?
If a task fails, the queue would still finish executing. To figure out which task has failed, you do “qdo tasks queuename” and that will list all the tasks commands in the given queue and their status.
Then you can do “qdo retry queuename”: put all failed tasks in the queue back into pending.
Some Terminology Distinctions
- qdo retry: put all failed tasks back into pending
- qdo recover: in case a job crashes due to system reasons, put running tasks back into pending
- qdo rerun: put all tasks in a given queue back into pending. You can add flags to this command to specify a subset of tasks to rerun, e.g. “qdo rerun exitcode=0” would only put the “succeed” tasks back to pending.
- Waiting: tasks whose dependency hasn’t been run yet
- Pending: tasks has cleared all dependency and waiting for execution