One peculiar cluster and how I brought it down

Problem statement

Some time ago I worked on project that required a HPC cluster, to my luck my institution recently created a cluster.

It was a peculiar beast (well I don’t know — maybe the setup was standard — for scientific clusters), but to me it was peculiar.

It used ``torque <http://www.adaptivecomputing.com/products/open-source/torque>``__ task submission system. Which was pretty straightforward, you had a command line tool, that allowed you to start executable on a defined number of CPUs.

What I wanted though was rather: “an ability to call a function with different parameters on defined number of CPU’s”.

Solution

After digging through some documentation, I figured that torque passed local environment variables to child processes.

So maybe if I serialized the function call to an enviornment variable, and then created a script that would unpack this variable end execute it, I could achieve what I wanted.

To pull this off I needed:

  1. Ability to serialize python bytecode (this is not supported by pickle), but we have dill: https://github.com/uqfoundation/dill.

  2. Know that serialized code would fit in the variable. SO says that Linux supports envvars over 1mb: https://stackoverflow.com/q/1078031/7918 (which is enough)

  3. It is secure.

    And I think security wasn’t that bad — you could make my script to execute arbitrary code, but it would be done using privileges of your user, which you could do would do anyway.

Well and here is the library that does it: https://github.com/jbzdak/torque-submitter .

Aftermath

This worked well enough, up to time when I took the whole cluster down, accidentally.

Torque has something called: “array jobs” this is which is consists of many tasks, each of them gets task_id in an environment variable.

Array jobs worked well for me, until I started a job with 400 tasks. Then the cluster exploded (well not physically but stopped accepting any tasks, which irritated a lot of people).

It turns out that each of the jobs in the array got copy of the whole environment, and I did serialize each function call to the said environment separately, which meant that memory requirements rose quadratically, in the end OOM error brought down the controller.

After that I decided to store serialized bytecode in a file, and then just store file name in the environment variable.