Robust Software - Really, *REALLY* Hard To Do...
By 2PM, I had completely given up. I finally got ahold of someone at around 1PM and I never heard back. From what I understand, the entire system crashed at around 4PM...From all appearances, it was a combination of lack of robustness, lack of scalability, and a pernicious consultant culture that doomed it.
So, the end result was that 30,000+ of the most active and fired-up volunteers were wandering around confused and frustrated when they could have been doing anything else to help
This wasn't the first time I'dseen problems of this type, and it almost certainly won't be the last. The underlying issue was pretty standard. - basically, the word "Robust" takes on a whole new meaning when failure is not an option.
This is really important, and worth repeating loudly -
"Robust" takes on a whole new meaning when failure is not an option.
So, what exactly do I mean by that?
Well, most software/IT projects that we are involved in tend to be somewhat bug-tolerant - it doesn't really have to work perfectly the first time around, the second time around, or heck, ever.
Come to think of it, you've probably all heard statements like
- "We'll fix that in the next release"
- "That feature hasn't been implemented yet"
- "Log that as a bug, and we'll get around to it when we get a chance"
- "Well, just don't do that!" <--- and my favourite
There is nothing inherently surprising here - we're accustomed to a world in which everyone understands that requirements are fluid, systems are ill-defined, and humans are fallible. We just get along with our lives in the face of that mantra - Good Enough is good enough.
• Email bounce? Send it again.
• '503 Service Unavailable'? Wait a few and try again.
• The blinky lights go orange? Give it a few minutes, they'll probably go green again.
• Computer all slow and laggy? Thats just the way computers are...
• etc.
This is what I call eventual functionality, i.e., it'll eventually function the way you want it to, and that is good enough 😆
In opposition to this, we have systems that Have to work perfectly the first time around. (Hence, "failure is not an option").
You certainly don't want to hear things like
• "Yeah, the control rods didn't work, we'll fix that in the next iteration of the Reactor", or
• "Hmmm, seems to be a bug, the Mars Rover just turned itself off and we can't turn it on again"
(Note: For a classic writeup on Robust Software, The Mars Rover, and The Erlang Way, read Getting 25 Megalines of Code to Behave by Jesper Andersen...)
Most people do not know, or worse, are not aware of this difference. Heck, they still think that they can build Robust systems by, well, MOAR TESTING, FOLLOW THE RULES (or some such other idiocy).
Multiple redundant systems, designed and developed by different groups, with obsessive levels of fault-tolerance, loose coupling everywhere - these aren't just "nice to haves" or "good ideas", they are impossible to not have.
(Note: A lot of this might seem old hat to Erlang types, and that should only serve to hammer home the point...)
Things just get worse from there - we are all inherently believers in Survivorship Bias, where we look at successful large-scale IT projects and say "Hey! This can be done! I can do it too!". People don't realize that for every successful IT project, there are a whole bunch that failed, sometimes catastrophically. (Metrics vary, but pretty much everybody agrees that somewhere from 65% to 80% of IT projects fail. And that doesn't include the ones that don't quite work the way they are supposed to!).
And its not just IT projects - 75% of all startups fail. So, those massively scalable cloud thingies out there - Pinterest, Instagram, etc. - are just the ones that succeeded, with the iceberg of failure having taken out all the others (for many many reasons, but trust me, "not working the way it should" is up there...)
To recap,
- Most people don't know from Robust Systems
- Most people really, really don't grok Scalability.
Just because you worked at Facebook, and/or read High Scalability Dot Com, does not, repeat NOT, mean that you are now capable of building the next massively scalable system, let alone, a system that Must Work Perfectly The First Time Out.
Romney was not the first person to discover the above, and he certainly will not be the last...
(Note: For an excellent analysis of the meltdown, check out Ars Technica's coverage...)
Comments