Test Coverage Applies To *All* Your Code!

Way back in 2014 Yuan et al conducted an analysis of production failures that brought down distributed systems (•). They went through 198 random failures in well known distributed systems like Cassandra, Hadoop, and whatnot, and, well, the results were remarkably depressing.
Basically, they found that for most of the catastrophic failures — where the entire system went down — the root cause was ridiculously simple.
How simple you ask? Well, it turns out that around 90% of them boiled down to non-fatal errors that weren’t handled correctly.
Yeah, I’ll let that sink in for a bit. Bad error handling. That’s it.
Ok, now that it’s sunk in, let’s make it worse. 35% of the catastrophic failures boiled down to one of the following three scenarios
  1. 1. The error-handler catches an overly general exception (e.g. an Exceptionor Throwable in Java), and then shuts the whole damn thing down. BOOM.
  2. 2. Or, better yet, the error-handler actually had tells like // FIXME or // TODO in their comments. (Yeah, you really should scan for these before release!)
  3. 3. Best of all, the error handler was empty! Or just a simple log statement!
You really have to love this, y’know?
By the way, it turned out that another 23% of the catastrophic failures, while specific to the system being tested, were actually really easy to root out. And not in a “Well, now that you know where the bug is is” kind of way — remember, we’re still talking about error handling! It turns out that if there was just decent code coverage on the error handling, it would trigger the bug!
Anyhow, go read the rest of the paper, it’s a fun read. But while reading, remember, that around 58% of the catastrophic failures were due to poor (or no!) error handling in the code.
Just remember boys and girls, test coverage applies to all your code!

Comments

Popular posts from this blog

Erlang, Binaries, and Garbage Collection (Sigh)

Cannonball Tree!

Visualizing Prime Numbers