Software Resiliency, and Bulkheading

You know what a bulkhead is, right? You know — in ships, where they create watertight compartments so that in case there is a leak, it is isolated to just one part of the ship, instead of the whole thing pulling a Titanic?
Well, pretty much the same applies to software, where “bulkheading” is where you isolate parts of the system so that if one part of it barfs, it doesn’t infect and/or clobber the other parts of the system. It’s one of the core-precepts of building fault-tolerant systems, viz., Error Encapsulation.
From a “big-picture” architectural perspective, yes, this does mean all the usual stuff about “loose-coupling”, “component-driven architecture”, “microservices”, and pretty much any other buzz-word in this vein that you can think of. However, the actual functionality associated with this is something that you should keep in mind — and towards that, let’s look at a specific example
You’ve got a connection pool, serving two sets of back end services which are (very imaginatively!) called “A” and “B”. Life is good, everything is working well, customers are happy, and the buckos are rolling in. And then…
And then, tragically, something bad happens in “B”, and it craps out. At which point, all the connections to “B” start backing up, which exhausts all the connections in the Pool. The issue here is that “B”’s crapping out also causes the Pool to deny any access to “A”, even though it was working perfectly fine. 
Mind you, if there was a dependency ‘tween the two (“A should only work if Bis working”) then this is a good thing, but if there isn’t, well, the system un-necessarily took down “A”, right?
The solution is somewhat straightforward — you partition the Pool into specific sets for “A” and “B
Now, when “B” horks, all that happens is that the “sub-pool” for “B” fills up, leaving “A” happily functioning and accessible.
The example here is for a simple connection pool, but it actually applies at multiple levels of your application. For example
  • • Are critical components isolated from non-critical ones?
  • • Are your components in different containers?
  • • Are these containers on different (virtual) nodes?
  • • Are these (virtual) nodes on different (physical) nodes?
  • • Are async requests isolated from each other based on source and/or target?
  • • Are these requests isolated using queues?
  • • Is the queueing subsystem itself fault-tolerant?
The list goes on, but this should give you a good idea.
Mind you, there is a certain amount of “good news bad news” here. The good news, of course, is that you do get resiliency. The bad news is that you now have more stuff to instrument and monitor, more complexity in your system, and more emergent behaviors that can lead to strange and infuriating issues.
As with most things, you should start worrying about this when you are leveling-up your system, when fault-tolerance is actually a thing that you care about (and it’s not just a prototype / PoC that you are dealing with!)
For libraries that already implement this, and excellently at that, take a look at Hystrix and Polly.

Comments

Popular posts from this blog

Cannonball Tree!

Erlang, Binaries, and Garbage Collection (Sigh)

Visualizing Prime Numbers