November 2, 2006

FSOSS: Curious George and the $100 Million Supercomputer

Curious George and the $100 Million Supercomputer by Phil Schwan

For the last four years I helped build storage systems for the largest government, corporate, and academic clusters in the world. This morning we'll do a whirlwind tour of some of the corner cases nobody talks about: not just how to store petabytes of data and transfer tens of gigabytes per second (that part is easy?), but our struggles with Linux kernel disasters, physicists who think they're software developers, POSIX-induced migraines, and debugging a 10,000-node state machine on hardware nobody else has from the wrong side of a classified network.
The most gloriously techie of all the sessions I attended, so much so that quite a bit of the hard-core unix/linux stuff was a bit beyond me. In any event, it was still very interesting. Schwan co-founded Lustre, a company that does file systems for super huge cluster computing systems. The talk was about the challenges of creating these highly parallel cluster systems.

The aim of the Lustre project he discussed was to create a petabyte storage system that was also very fast, supporting 10GB/sec transfer rates. Some of the other goals of the system were to be able to support the largest supercomputer systems, be one of the top 500 cluster computing sites, to use the POSIX Semantics file system and to have a 100% recovery rate.

There were a couple of uncomfortable lessons from this effort: don't bother writing your own operating systems for a project like this, as it will only cause a lot of delays. The api interface to the Linux kernal is far too unstable
to use for a high-profile project (in particular, this part was very techy). One good thing they learned: using POSIX as a file system was a good choice as it made the whole cluster behave like one machine.

The next part of the talk was called, "Software Sucks," basically about the perils of very large software projects. A bit naive about such things they learned many lessons (perhaps they should have read the Fred Brooks' book above...or perhaps any book on software engineering?). First of all, the software industry is a disaster, allowing error rates no other industry would tolerate. Typical comparason: if bridges or planes failed at the same rate as software? The did discover the Personal Software Process methodology which saved their bacon, but I suspect virtually any methodology would have done the same.

(Update: TOC of my FSOSS posts, FSOSS agenda, video recordings of sessions)

No comments: