All The Post-Mortems
![]() |
/via /via http://www.bugbash.net/comic/113.html |
There has been a ton of stuff written about Post-Mortems (•), and I’m not going to bother re-capping the field. That said, the things to keep in mind, and internalize, are
- 1. This is about fault-tolerance.
Your mantra during the process should be “How can we prevent this from happening again?” - 2. S**t will happen.
Failures aren’t “if”, they are “when”. So, keep asking yourself, “When it happens again, how do we recover?” - 3. If you end up with human error as the root cause, you didn’t do it correctly.
Mind you, “it” could be the post-mortem, or the system. - because, if you rely on humans being infallible, well, you’ve got a surprise in your future (and an unpleasant one at that) - 4. It is not about blame.
Seriously. I know we all say that, but this, really is not about blame. Go back and read -1- through -3- above, with emphasis this time
The best way to get a handle on post-mortems (and, for that matter, your system’s failure modes!) is to go through a bunch fo them. And, on that note, danluu is doing yeoman’s work in collecting and categorizing post-mortems. They’re broadly cagegorized by Config Errors, Hardware/Power Failures, Conflicts, and Time. There is also a catch-all bucket for uncategorized stuff, and the beginnings of a section on analysis.
They’re all across the board, and across all companies, and make for fun reading (if only from a “There, but for the grace of god, go I” perspective). Much more importantly, if you’ve got one to add, send in a PR!
Anyhow, it’s good stuff, go check the list out!
(•) For a good start, go check out this writeup at ServerDensity.
Comments