qdo

Background qdo Concepts

Task, Job & Worker

  • A task is a Linux command
  • A number of tasks are grouped into a queue (see Stephen’s example)
  • A job is a queue submitted as a batch job.
  • Workers are the number of processes (process may not be the correct terminology) that can run simultaneously within a job. 

To illustrate with an example:

We can type in the following qdo command on Edison: “qdo launch (2 jobs, 4 workers, 96 cores)” 
Then what happens on Edison is the 2 jobs are submitted simultaneously through Edison command “qsub [jobname]”; the 96 cores are divided between the 4 workers so each worker gets 24 cores. Each worker independent grabs a task from the pending list from the two jobs and perform “qdo do” on the task, using the 24 cores (a.k.a one node) it has access to. Once a worker finishes a task, it goes back to the pending list of the queue and grab another task to perform. This happens recursively until the queue is complete. 
The reason why we submit two jobs here to run simultaneously instead of one bigger job that combines the two jobs is, among other technical considerations, the fact that shorter jobs that require short walltime wait less in the NERSC queue. 



What Happens When A Task Fails?

If a task fails, the queue would still finish executing. To figure out which task has failed, you do “qdo tasks queuename” and that will list all the tasks commands in the given queue and their status. 
Then you can do “qdo retry queuename”: put all failed tasks in the queue back into pending.

Some Terminology Distinctions

  • qdo retry: put all failed tasks back into pending
  • qdo recover: in case a job crashes due to system reasons, put running tasks back into pending
  • qdo rerun: put all tasks in a given queue back into pending. You can add flags to this command to specify a subset of tasks to rerun, e.g. “qdo rerun exitcode=0” would only put the “succeed” tasks back to pending.


  • Waiting: tasks whose dependency hasn’t been run yet
  • Pending: tasks has cleared all dependency and waiting for execution

Feature Prioritization Research

Question 1: Verify if a dashboard was needed

RESEARCH METHOD: INTERVIEW

  • How often do you monitor your queues?
  • How frequently should information on the UI be updated?
  • What’s the biggest value of the GUI to you, compared to command line?
  • How often do tasks fail?
  • What do you do when tasks fail?
  • Do you launch qdo add, and qdo launch on separate occasions, or together?
  • How many queues and how many tasks do you usually have?

Question 2 (if Question 1 validates dashboard)

RESEARCH METHOD: CARD SORTING

PROTOCOL

  1. Here are the potential items to display: (Maybe good as a graphic card sorting activity, if a certain job status does not trigger action, then it should not be on the dashboard.) How often would you access the above information (w/o clicks, w/ one click, w/ two clicks, w/ three clicks)? Remember the less clicks, the more clustered the screen will be.
  2. Once items have been sorted, ask (especially for the w/o click items):
    • Describe a situation when this information would lead you to do something OR Give me an example of the actual data that would appear and the action that you would take in response. 
  3. Once items have been pruned from (2), ask about the number related items:
    • What are the useful comparisons (targets, standards, past data etc.) that will allow you to see these items of information in meaningful context?
  4. How would you group these items that could be used to organize the items of information on the dashboard?

 

ITEMS TO SORT

a.    task status in a queue

1)    Task list (list of individual tasks)
2)    Task status (list of the statuses of individual tasks without task detail)
3)    Failed task (list of failed tasks)
4)    Completed task (list of completed task)
5)    Waiting/pending/running/rerunning (currently not feasible in backend, ask Stephen if this would be useful) tasks ( list of tasks of each of the three statuses)

b.    Queue status

1)    Queue status (the number of tasks in the queue that are waiting, pending, running, succeed, fail) 
2)    Queue failures (a simple warning that there is at least one task that failed in the queue)
3)    Queue completion (notification that a queue is completed)

c.    General stats (* marks the number items that need comparison context)

1)    *Number of workers for a queue
2)    *Queue running time list
3)    *Queue waiting time list
4)    *Number of tasks in a queue
5)    Queue status (running/paused/resume/deleted (a list of each))
6)    Queue status (launched/ unlaunched (list of each)

d.    Any trends/history you want to see?

 

RANK ACTIONS

The card sorting activity may have given ideas to what actions are needed.
Rank importance of these action:

a.    Retry/rerun/recover
b.    Pause/Resume
c.    Delete
d.    View
e.    Edit (currently not feasible in backend)
f.    Other actions

Fill out the action verb with subject and object. E.g. qDO rerun the queue