Modern scientific discoveries rely on the power of supercomputers. However, managing computational tasks is sometimes difficult on a supercomputer. During my internship at Lawrence Berkeley National Lab, the birthplace of 13 Nobel Prizes, I was excited to help increase scientific productivity by creating a supercomputer task management tool.
Summer 2015 | Internship Project
Megha Sandesh, Gonzalo Rodrigo Álvarez
User Research - Persona, User Journey, Card Sorting
Design & Prototyping - Balsamiq Mock-Up, Participatory Design
Development - AngularJS, Sass
The challenge was to design and develop a web UI that would help scientists to monitor the status of their supercomputer tasks and to perform actions on them. The web UI was a new addition to a supercomputer task management toolkit called qdo, which also offered a python library, a suite of bash commands and an API.
I started the project with user research. Working with a scientist user, I created a persona and a user journey, and selected features for prioritization through card sorting. I also considered how the web UI would work together with the other components of the qdo toolkit, which were good at performing batch actions, but poor at visualization. At the end, I decided the web UI’s primary function should be monitoring task status, using different forms of data visualization.
I iterated over five design prototypes, progressing from lo-fi to interactive. After the design was finalized, I developed the UI together with two colleagues using AngularJS. At the end of my internship, qdo had a fully functional web application available to users at the National Energy Research Scientific Computing Center.
High Performance Computing (HPC), more commonly known as supercomputers, are huge computational machines consist of thousands or millions of sophisticatedly interconnected processing cores that can perform vast amounts of computation at a fraction of the time required by a general-purpose computer. HPC’s are important enablers of modern scientific research, whose findings are often results from millions of calculations too large for any general-purpose computer to handle.
Despite the major benefits, HPC comes with increased task management difficulty for research that uses large volume of small computation jobs. Imagine that you submitted 10,000 computation jobs to a HPC, some jobs have finished successfully, some have failed and need debugging, some are still running, some are waiting in line, and others are pending to get in line. Considering that your batch of jobs will take days to run, how do you keep track of the status and check if anything needs your attention?
Qdo is designed to address this challenge. It is an open-source supercomputer task management tool that allows supercomputer users to easily monitor the status of their computational tasks and to perform actions on them. It offers a web application, a python library, a suite of bash commands and an API, each can be used independently or in coordination.
I was tasked with the design and development the web UI.
I started my work with user research, for which I set two goals:
- Get Familiar with the technology.
- Explore how the web interface would work together with qdo's other components (python library, bash commands, and API) to deliver an integrated experience.
Learning the Technology
Since qdo was designed for users with special domain knowledge, before I could start any design work, my first task was to learn the domain for which the UI would be used. The need to overcome the domain knowledge hurdle is a big difference between designing for a technical product (e.g. nuclear plant control panel) versus designing for a common-place product (e.g. email client). For Project qdo, this meant spending time learning about HPC, parallel computing, and scientific data analysis workflow. With the help from reference books, colleagues and my CS knowledge, I soon gained enough understanding to begin the design process. Here are some of my tech notes.
Target Users & Recruiting
Qdo’s targeted users were scientists who used HPC to run large volumes of small computation jobs. Cosmologists were the primary supercomputer users with this characteristic. Since most scientists were very busy with research, the project gave me access to only one cosmologist, S, for user research. It was challenging to work with only one user, but I had to work within this constraint. Given the fact that qdo’s target user group was specific and narrow, and the users possessed highly expert knowledge in the experience I was designing for; I felt that doing in-depth user studies with S, and engaging him in participatory design could still cover a lot of usability ground.
Jacob is a 40-year-old cosmologist. He received his physics PhD from Stanford in 2000, and has been working at Lawrence Berkeley National Lab since graduation.
He is the Principle Investigator of the Large Structure Cosmology group at the lab, and is currently in charge of four projects. In addition, he also teaches COSMO 215 at UC Berkeley. He has five staff members working in his group, and one post doc. Many of his projects involve using telescope images to analyze cosmic changes. For example, in the Structure and Origin of New-Born Supernova project, he uses the telescope images to discover new supernovas.
He likes computer programming, took CS classes as an undergrad and has written many small programs on his own for his projects. He is fluent in Linux, C, Fortran, and dabbled into web programming but hated it. He usually uses command line for conducting his research work. Nonetheless, he wants to explore the possibility of using the web interface for his projects.
Jacob uses the supercomputer to run telescope image processing jobs. The scheduling of the processing workflow is tedious, so Jacob wants to make a tool to ease his pain, at the same time benefitting other cosmologists.
Schedule and Lifestyle
Jacob is married with two children. His family live in El Cerrito, a 25-minute drive from the lab. Jacob is busy with his work and family duties. He usually works at the lab or on campus, and works from home to take care of kids about once every two weeks. In his very little spare time, Jocab likes hiking and biking in nature with his family.
When I joined the project, qdo already developed a beta python library, and some users, including my user S, had been using the python library for their work. I felt that understanding how users utilized the qdo python library to perform task management would help me position the value of a web application in the integrated qdo toolkit - python library, command line, API, and web app. I created a user journey to capture my findings.
Before running qDO
- Collect telescope images
- Divide images into 30 subimages, each with 20 sub-regions, each sub-region takes about 30 minutes to process.
- Plan out the workflow: parallel processing of sub-regions → merge into sub-images → merge into image → split into sub-images for post processing
- Write execution script
- Use python to translate the workflow into a sequence of auto generated Linux commands
- Use qdo.add to load the commands into one qdo queue
- Specify how many workers are needed for running this queue as one job
- Use “qdo launch” to run the queue as one job
q = qdo.connect(…) for image in range (30): for subregion in range (20): cmd = “blat image subregion” q.add (cmd) qdo launch
While running qDO
Check queue status
- Queue completed with all tasks succeeded → We are good!
- Queue completed with some tasks failed → Inspect and fix
- Use “qdo tasks” to inspect what tasks failed and what the task contents are
- "qdo retry" directly
- Fix the tasks
- "qdo retry"
- Queue crashed due to system reasons → "qdo recovery"
After running qDO
Analyze the processed data and gather insights. Hopefully new scientific discoveries!
Findings (Part 1)
Synthesizing the results from persona and user journey, I came to the following findings:
- Qdo web UI may be suitable for monitoring the job status after a qdo job has been launch
- It may also help with actions dealing with post launching activities
- It may not help much with launching jobs, as the Python library is more efficient at taking care of that.
- Something to explore is whether the UI can help with post processing after running qdo.
Now that I had decided the web UI would be primarily used for monitoring, a dashboard layout could be a good design option. I consulted the book Information Dashboard Design by Stephen Few to come up with a research method for
- Verify if a dashboard was needed
- Prioritize what items should be displayed
I thought an interview with S would help answer the first question, and a card sorting session would help address the second question.
You can read about my research plan here.
Findings (Part 2)
- The two most important UI features for S are
- When things are working, show how much has been completed
- When things have failed, show what failed
- Add tasks and add queue are more efficiently performed via command line. So these functionalities are low priority for UI.
- Delete multiple queues at once maybe useful.
- A burn-down chart that visualizes the number of tasks in each status category throughout time can be useful for diagnosing problems.
- Normally he can be running hundreds of queues at the same time, and up to a million tasks in each queue, each task may be a command up to 600 characters long. I needed to consider the content volume when designing the interface layout.
Card Sorting Results
S ranked qdo features in the following order (features are notated in their perspective bash commands):
- qdo list/ qdo status
- qdo tasks
- qdo retry
- qdo recover
- qdo rerun
- qdo pause
- qdo resume
- qdo delete
- qdo launch
- qdo create
- qdo load
- qdo add
User research helped me get into the head of a scientist user. Gradually, the task of designing for a domain I previously knew little about no longer seemed daunting.
When I joined the project, the web application was at an early stage. It had implemented only basic functionalities, and the UI was intended as a place-holder. After evaluating the existing UI, I decided to create a new design from scratch.
I iterated over five prototypes, progressing from wire-frames to interactive prototypes. S gave me feedback after each prototype, which I would include in the next version.
Below I illustrate four webpages throughout time:
User Homepage (v1 - v5)
Queue Detail Page (v1 - v5)
Queue By Status Page (v1 - v4)
Base Template (v1 - v3)
Now that the interaction design was finalized, I experimented with colors and shapes to get ideas for the application theme. I made a style tile, which is a quick and effective way to get an overall look and feel of a design aesthetics.
I developed the web application along with Megha Sandesh (front-end) and Gonzalo Rodrigo Álvarez (back-end). I used AngularJS, Sass, D3, and Google Material Design Template.
Queue Detail Page
Queue By Status Page
Task Visualization Demo
Here is an example of the interactive graphs, which are part of the queue detail pages. These graphs help users visualize changes in task status over time. I developed them using D3.