Distributed Systems, and Coding for Failure

The world of Distributed Systems is as much about philosophy as it is about implementation — the specific context, architecture, tooling, and developers all come together to make your way different from everybody else’s.
And yet yet, there are overarching themes and patterns that hold true, that you ignore at your own peril. The most important of these is the necessity to architect for failure — everything springs from this. (°)
Complexity grows geometrically with the number of components, and no amount of fakes, strict contracts, mocks, and the like can reduce this. If anything, poor practices will only increase this complexity
The key is to accept this, and align the risk associated with a given service/component with the risk the business is willing to take. In short, make the component reliable, but no more reliable than necessary”
As a developer, what this means is that you must
  • • Understand the operational semantics of the system as a whole
  • • Internalize the characteristics of the dependencies, especially for your component (on both sides! Things it depends on, and things that depend on it!) 
  • • Make Observability your mantra (°°)
Live this, learn from this, pass it on…
(°) Yes #Erlang peoples, shades of “Let It Crash” here. #FaultTolerance is the name of the game when it comes to building out distributed systems, and, to date, there is no better environment to do this in than Erlang…
(°°) #Instrumentation =/= #Observability. It’s a subset thereof. There is also debugging, exploring, logging, metrics, and a s**t-ton more. Go Google this…

Comments

Popular posts from this blog

Cannonball Tree!