You've probably seen or heard of Project ORCA by now - the Romney Campaign's massive/high-tech
Get Out The Vote effort that failed spectacularly. As
John Ekdal put it
By 2PM, I had completely given up. I finally got ahold of someone at around 1PM and I never heard back. From what I understand, the entire system crashed at around 4PM.
From all appearances, it was a combination of
lack of robustness,
lack of scalability, and a
pernicious consultant culture that doomed it.
This isn't the first time I've seen problems of this type, and it almost certainly won't be the last. The underlying issue is pretty standard, viz.,
"Robust" takes on a whole new meaning when failure is not an option.
What do I mean by that?
Well, most software/IT projects that we are involved in tend to be somewhat bug-tolerant, with it not really having to work perfectly the first time around, the second time around, or heck,
ever. You've probably all heard statements like
- "We'll fix that in the next release"
- "That feature hasn't been implemented yet"
- "Log that as a bug, and we'll get around to it when we get a chance"
- (and my favourite) "Well, just don't do that!"

There is nothing inherently surprising here - we're accustomed to a world in which everyone understands that requirements are fluid, systems are ill-defined, and humans are fallible. We just get along with our lives in the face of that mantra -
Good Enough is good enough.
* Email bounce? Send it again.
* '
503 Service Unavailable'? Wait a few and try again.
* Phone lights go orange? Give it a few minutes, they'll probably go green again.
* Computer all slow and laggy? Thats just the way computers are...
It is basically
eventual functionality, i.e., it'll eventually function the way you want it to, and that is good enough :-)
In opposition to this, we have those systems that
Have to work perfectly the first time around. (Hence, "
failure is not an option"). You certainly
don't want to hear "
Yeah, the control rods didn't work, we'll fix that in the next iteration of the Reactor", or
"Hmmm, seems to be a bug, the Mars Rover just turned itself off and we can't turn it on again"
(
Note: For a classic writeup on Robust Software, The Mars Rover, and The Erlang Way, read Getting 25 Megalines of Code to Behave by Jesper Andersen...)
Most people do not know, or worse, are not aware of this difference. Heck, they still think that they can build Robust systems by, well, Testing More and Following The Rules (or some such other idiocy).
Multiple redundant systems, designed and developed by different groups, obsessive levels of fault-tolerance, loose coupling everywhere - these aren't just "nice to haves" or "good ideas", they are
impossible to not have. (
Note: A lot of this might seem old hat to Erlang types, and that should only serve to hammer home the point...)
Things just get worse from there - we are all inherently believers in
Survivorship Bias, where we look at successful large-scale IT projects and say "
Hey! This can be done! I can do it too!". People don't realize that for every successful IT project, there are a whole bunch that failed, sometimes catastrophically. (Metrics vary, but pretty much everybody agrees that somewhere from
65% to 80% of IT projects fail. And that doesn't include the ones that don't quite work the way they are supposed to!).
And its not just IT projects -
75% of all startups fail. So, those massively scalable cloud thingies out there -
Pinterest,
Instagram, etc. - are just the ones that succeeded, with the iceberg of failure having taken out all the others (for many many reasons, but trust me,
"not working the way it should" is up there...)
To recap,
- Most people don't know from Robust Systems
- Most people don't really, really grok scalability.
Put the two together, and you're almost destined to end up with something like
Project ORCA.
Just because you worked at
Facebook, and/or read
High Scalability Dot Com, does not, repeat
NOT, mean that you are now capable of building the next massively scalable system, let alone, a system that Must Work Perfectly The First Time Out.
Romney is not the first person to discover the above, and he certainly will not be the last...
BTW, for an excellent analysis of the meltdown, check out
Ars Technica's coverage...