The fault is not in our stars

Netsplits are rare, so I don’t think about them
 — #CowboyDeveloper
The thing about the above statement is that even if you aren’t a #CowboyDeveloper, it’s not necessarily wrong. That’s for a given value of “rare” mind you, and you ignore it at your own risk.
The question you should ask yourself before making the above statement is “What is the risk associated with a netsplit?”, and the answer to that should inform your engineering decisions (•).
And that brings us to the main point here — this is not just engineering decisions about network partitions — it’s all your engineering decisions!
So great — you assess the risks, plan / design accordingly, and it’s all copacetic, right?
Well, no, and that’s because “assess the risks” is carrying a lot of water. The issue here is that the people implementing the systems are, well, human, and as humans, they are quite likely to end up with some combination of
  1. 1. “If I don’t know about it, it doesn’t exist”. Quite common at a junior level, since they literally don’t know what they don’t know. Surprisingly common at a senior level, especially in #TechBro / #CowboyDeveloperculture.
  2. 2. “I don’t want to report this issue because…reasons?”. aka Being Human, with the reasons being pride / fear / sloth / vanity / shame / …
  3. 3. “Meh, it’s a once in a lifetime thing”. This is, sadly, the worst of the lot. It doesn’t matter how many times we measure before we cut, the actual carpenter is quite likely to take some shortcuts based on their work patterns. In short, they superimpose their assessment of risk-profiles on yours.
Put it all together, and you end up in a situation where Shit Will Happen. What’s worse is that when it happens, you’ll have known the issue all along, and could have dealt with it fairly early.
There is a nifty NASA study that goes into real world examples of faults, expensive faults, with serious consequences (plane crashes, fleets being grounded, etc.) They include bad-craziness like
• Self-inflicted shrapnel taking out a triply-redundant data-cage
• Different parts of the same metal frame being out of phase with each other (eddy currents)
• Software that spontaneously “evaporated”
• A cooling system that turned itself into an air-conditioner
The not-so-funny bit is that every single one of these was involved known issues, with, and this is key, known solutions. They ended up happening because, well, we’re human.
Being Agile and Fault Tolerant helps, but only if you are in the lucky enough position of not having dire consequences in case of failure. OTOH, if you are providing a life-line service of some kind, then you need to incorporate risk analysis everywhere.
And that involves the humans involved. The “soft risks” associated with Being Human are the hardest ones to measure, in contrast to which the risks associated with your project design are a breeze. We, naturally, flock to this, evaluate and design the heck out of the system, and, in passing, wave at the human factors.
Don’t shirk these risks. Think about them. Spend as much time as you can on all aspects thereof — you may not get it right, but even a half-assed take on it is better than doing nothing!
(•) Do note that you really need to think about the question, since network partitions come in all forms from a simple-two way split to something horrific where your 4 nodes end up in a 1–1–1–1 split

Comments

Popular posts from this blog

Erlang, Binaries, and Garbage Collection (Sigh)

Cannonball Tree!

Visualizing Prime Numbers