All The Post-Mortems

- April 02, 2018

/via /via http://www.bugbash.net/comic/113.html

There has been a ton of stuff written about Post-Mortems (•), and I’m not going to bother re-capping the field. That said, the things to keep in mind, and internalize, are

1. This is about fault-tolerance.
Your mantra during the process should be “How can we prevent this from happening again?”
2. S**t will happen.
Failures aren’t “if”, they are “when”. So, keep asking yourself, “When it happens again, how do we recover?”
3. If you end up with human error as the root cause, you didn’t do it correctly.
Mind you, “it” could be the post-mortem, or the system. - because, if you rely on humans being infallible, well, you’ve got a surprise in your future (and an unpleasant one at that)
4. It is not about blame.
Seriously. I know we all say that, but this, really is not about blame. Go back and read -1- through -3- above, with emphasis this time

The best way to get a handle on post-mortems (and, for that matter, your system’s failure modes!) is to go through a bunch fo them. And, on that note, danluu is doing yeoman’s work in collecting and categorizing post-mortems. They’re broadly cagegorized by Config Errors, Hardware/Power Failures, Conflicts, and Time. There is also a catch-all bucket for uncategorized stuff, and the beginnings of a section on analysis.

They’re all across the board, and across all companies, and make for fun reading (if only from a “There, but for the grace of god, go I” perspective). Much more importantly, if you’ve got one to add, send in a PR!

Anyhow, it’s good stuff, go check the list out!

(•) For a good start, go check out this writeup at ServerDensity.

Search This Blog

And you are here why?

All The Post-Mortems

Comments

Popular posts from this blog

Sysadmin Day - July 27th

Erlang, Binaries, and Garbage Collection (Sigh)

Cannonball Tree!