Posts

Showing posts with the label Reliability

DevOps is *not* The Hero’s Journey

Image
“Ascribe meaning to random events” ← this pretty much sums up the human condition.  We hear ghosts in White Noise ,  see patterns in random numbers , and a whole bunch more. There’s even a term for it —  apophenia  — “ the tendency to mistakenly perceive connections and meaning between unrelated things ” The tricky part here though is that this  is  part of the human condition. It’s so deeply embedded in our fibre that we don’t even notice it happening, no matter how rational we are. Take  The Hero’s Journey  — the saga is endemic in popular culture ( Bill & Ted ,  Groundhog’s Day, Harry Potter, Lord of the Rings ,  Matrix ,  Star Wars , I could go on forever…) that we tend to find it surprising when the narrative doesn’t fit the mold. The problem, of course, being that we expect our lives to follow this narrative too,  while the universe just doesn’t give a s**t about our expectations. I see this, in particular,  all  t...

What Price Resilience

Image
So you’re going to revolutionize ice-cream delivery, through the powers of your iOS/Android app. ( Work with me here. Pretend that you’ve actually invented a better mouse-trap ) Your back-end runs on an AWS/EC2 instance, which is fine while you’re developing, but as you start thinking about real live customers,  reliability  comes to mind, and you split the back-end up so that it’s running on  TWO  AWS/EC2 instances. Great, right? If one goes down, the other takes over the load, and everything is hunky dory (as long as you the combined load is less than the capacity of any one of the servers, but let’s ignore that for now). Except, what if  AWS itself goes down ? As  June 29, 2012  taught a lot of us, the un-imagineable  does  happen every now and then. The solution, of course, is to deploy your instances across multiple availability zones, or maybe even multiple regions, or heck, while you’re at it, across multiple cloud providers....

Resilience and Restaurants

Image
A friend of mine has a restaurant in the East Village ( Grape & Grain  — you should go!). It’s a small thing, seating maybe two dozen people. The fascinating part is how the chef — Adam Rule — generates consistently great dishes despite his “kitchen” consisting of 3 induction plates and a convection oven, all tucked into a corner. I’ve seen him operating when • A group books the entire restaurant for dinner, and pre-selects the menu (“ Could you serve everyone clams for starters, then pasta, the sea-bass, and pavlova for dessert? ” • There are never more than a dozen or so in the restaurant, but as one couple leaves another shows up. This from 6pm to midnight • A dozen people suddenly show up, and proceed to order  everything  off the menu. And in all these cases, the food always shows up like clockwork, the entire party gets their pasta at the same time, newcomers don’t sit around waiting, etc. You get the point, the service is exactly what you would expect at...

Test Coverage Applies To *All* Your Code!

Image
Way back in 2014 Yuan et al conducted  an analysis of production failures that brought down distributed systems  (•). They went through 198 random failures in well known distributed systems like Cassandra, Hadoop, and whatnot, and, well, the results were remarkably depressing. Basically, they found that for most of the  catastrophic  failures — where the entire system went down — the root cause was ridiculously simple. How simple you ask? Well, it turns out that around 90% of them boiled down to non-fatal errors that weren’t handled correctly. Yeah, I’ll let that sink in for a bit. Bad error handling. That’s it. Ok, now that it’s sunk in, let’s make it worse. 35% of the catastrophic failures boiled down to one of the following three scenarios 1.  The error-handler catches an overly general exception (e.g. an  Exception or  Throwable  in Java), and then shuts the whole damn thing down. BOOM. 2.  Or, better yet, the error-handler a...

Life on the Edge

Image
So yeah, this was years ago, before AWS was a thing. We provided business phone services (•), and had all our servers at a hosting facility (With cages. Where we had our own servers. Remember, this was before AWS!). This was a  real  hosting facility, with batteries, and a generator to back the whole thing up. And we were a  real  phone company, with tens of thousands of customers. All good fun — for a given definition of fun, mind you. Until the one day when there was a massive snow-storm, and a huge percent of D.C. lost power. Including our hosting facility. Which was fine, because the batteries took over immediately. And the generator kicked in, because the batteries were only good for, like, a minute or two of power. And after two minutes, all our servers crashed. Because, as it turned out, during routine maintenance, some dude had disconnected the generator from the circuits,  and never reconnected them . The good news was that a decent chunk ...

The fault is not in our stars

Image
“ Netsplits are rare, so I don’t think about them ”  —  #CowboyDeveloper The thing about the above statement is that even if you aren’t a   #CowboyDeveloper , it’s not necessarily wrong .   That’s for a given value of “rare” mind you, and   you ignore it at your own risk . The question you should ask yourself before making the above statement is “ What is the risk associated with a netsplit? ”, and the answer to that should inform your engineering decisions (•). And that brings us to the main point here — this is not just engineering decisions about network partitions — it’s   all   your engineering decisions! So great — you assess the risks, plan / design accordingly, and it’s all copacetic, right? Well, no, and that’s because “assess the risks” is carrying a lot of water. The issue here is that the   people   implementing the systems are, well, human, and as humans, they are quite likely to end up with some combination of 1. “ If I don’t kn...

Processes That Rely On “People Doing The Right Thing” Will Fail

Image
I know, this is belaboring the obvious, but good gods, this particular brand of obvious so,  so  needs to be belabored! Why? Because we, as humans, are kinda hardwired into believing that we‘re better at things than we are (and when it comes to TechBros, fuggedaboutit — they’re the worst).   Oh we may think that we have our s**t together, but it’s an illusion — it turns out that  we can focus on no more than 3 or 4 things at a time . Any more than that, and we have to start resorting to “memory tricks”, things like removing distractions, grouping items, repetition, and so forth, and even then we’re only up to 7 or 8. Coding , for sure is a place where this applies. There is a reason why your system has easy to digest modules, each of which contains wee tiny functions/code-blocks, and all of which are assembled using bog-standard architectural patterns. It’s because, that way, you don’t need to actually keep track of every single thing that’s going on whenever...

Latency, Outliers, and Percentiles

Image
The good news around metrics is that it is increasingly easy to instrument and monitor your code, with plenty of usable software out there like  Prometheus ,  TICK ,  OpenTracing , and whatnot. The bad news is, well, pretty much the same — it’s so easy to instrument and monitor your code that the sheer amount of stuff that you can capture is mind-boggling. It is  so  easy to get caught up in all the charts and graphs that you will find yourself constantly losing sight of the the forest for the trees. All that data ticking around on the screen can, counter-intuitively, mean that it is actually  harder  to actually get any information (let alone knowledge!) from the metrics. Latency Latency is a particularly good example of this. Most everyone out there measures this as an  average , and while that’s better than nothing, it can easily hide issues. The fault, as always, lies in us — in that we tend to think about our systems in terms of steady-...