Posts

Showing posts with the label SRE

We’ve All Been There…

Image
/via http://www.commitstrip.com/en/2013/10/09/la-pire-sensation-du-codeur/ It was a glorious Friday night in Hoboken, back in 2008. Ok, this  is  Hoboken we’re talking about, so it wasn’t all that glorious, but we’d been out for dinner with friends, friends who appreciated wine, and, thanks to the wine (and friends!) it was a glorious night. Our taxi had  just  pulled up by our house, and everyone was trooping up for  MOAR WINE YES LETS SEE WHAT WINES THERE ARE YES , when I got a call from one of the SysAdmins. Him: “ Uh, dude, I think I did something ” Me: “ ???? ” Him: “ I may have shutdown our primary DB ” Me: “ !!!! ” Him: “ I  shutdown  the DB on my desktop when I left, but now that I think of it, I may have done it in the wrong window ” Me: “ !!!! ” Him: “ And I’m on the freeway, which is backed up seven ways to Sunday… ” Me: “ !!!! ” Him: “ And I don’t think the secondary took over, because Customer Support is getting slammed ” Me: “ !...

Connectivity Is Everything

Image
/via http://www.commitstrip.com/en/2012/03/14/whatever-it-takes/ I was supposed to be on vacation. Well, technically I  was on vacation, but the point here is that I should have been doing the stuff one does on vacation, like eating (and drinking!) too much, hiking, waking up late, and so on. Instead of what I actually  was  doing, which was trying to figure out how to bring our servers back up from where I was in the middle of nowhere. Let’s rewind this story a little, and put some context out there. At the time, I was the CTO of a phone company¹. Not just  any  phone company, we were one of the CoolAndInnovative™ ones — so cool that I had been named one of the “Top Influencers in VoIP” (which basically meant that people laughed openly at me, instead of behind my back). Anyhow, we had a bajillion VoIP phones scattered around the country, each of which connected to our data-center over the InterTubes.  Our  data-center mind you, as this was be...

DevOps is *not* The Hero’s Journey

Image
“Ascribe meaning to random events” ← this pretty much sums up the human condition.  We hear ghosts in White Noise ,  see patterns in random numbers , and a whole bunch more. There’s even a term for it —  apophenia  — “ the tendency to mistakenly perceive connections and meaning between unrelated things ” The tricky part here though is that this  is  part of the human condition. It’s so deeply embedded in our fibre that we don’t even notice it happening, no matter how rational we are. Take  The Hero’s Journey  — the saga is endemic in popular culture ( Bill & Ted ,  Groundhog’s Day, Harry Potter, Lord of the Rings ,  Matrix ,  Star Wars , I could go on forever…) that we tend to find it surprising when the narrative doesn’t fit the mold. The problem, of course, being that we expect our lives to follow this narrative too,  while the universe just doesn’t give a s**t about our expectations. I see this, in particular,  all  t...

What Price Resilience

Image
So you’re going to revolutionize ice-cream delivery, through the powers of your iOS/Android app. ( Work with me here. Pretend that you’ve actually invented a better mouse-trap ) Your back-end runs on an AWS/EC2 instance, which is fine while you’re developing, but as you start thinking about real live customers,  reliability  comes to mind, and you split the back-end up so that it’s running on  TWO  AWS/EC2 instances. Great, right? If one goes down, the other takes over the load, and everything is hunky dory (as long as you the combined load is less than the capacity of any one of the servers, but let’s ignore that for now). Except, what if  AWS itself goes down ? As  June 29, 2012  taught a lot of us, the un-imagineable  does  happen every now and then. The solution, of course, is to deploy your instances across multiple availability zones, or maybe even multiple regions, or heck, while you’re at it, across multiple cloud providers....

Resilience and Restaurants

Image
A friend of mine has a restaurant in the East Village ( Grape & Grain  — you should go!). It’s a small thing, seating maybe two dozen people. The fascinating part is how the chef — Adam Rule — generates consistently great dishes despite his “kitchen” consisting of 3 induction plates and a convection oven, all tucked into a corner. I’ve seen him operating when • A group books the entire restaurant for dinner, and pre-selects the menu (“ Could you serve everyone clams for starters, then pasta, the sea-bass, and pavlova for dessert? ” • There are never more than a dozen or so in the restaurant, but as one couple leaves another shows up. This from 6pm to midnight • A dozen people suddenly show up, and proceed to order  everything  off the menu. And in all these cases, the food always shows up like clockwork, the entire party gets their pasta at the same time, newcomers don’t sit around waiting, etc. You get the point, the service is exactly what you would expect at...

Percentiles and Distributions

Image
/via https://flowingdata.com/2017/05/02/summary-stat/ If you don’t pay attention to your’ data’s distribution, you run the risk of focusing on the wrong thing entirely. The perfect examples here are  the datasets generated by Justin Matejka & George Fitzmaurice  (•), where, thought wildly different, each dataset has the same summary statistics (mean, standard deviation, and Pearson’s correlation) to 2 decimal places! To really belabor this point, absent knowledge of the actual distribution, you  shouldn’t  rely on baseline summary statistics. Take  averages  for example: 1.  Outliers can — wildly! — skew your averages. You walk to work 200 days a year, and fly cross-country to the HQ in Los Angeles once a year. What’s the average distance to work? 2.  On the other hand, you averages can — totally — hide your outliers. Your DMV processes 290 out of 300 applications in less than 15 minutes, but the other ten take greater than 2 hours....

It's the Simple Things

Image
/via http://www.commitstrip.com/en/2014/06/03/the-problem-is-not-the-tool-itself/ I actually have a Post-it on my monitor that sez. “Have you thought of the Obvious?”. I mean, at the first cry of “ OHSHITOHSHITOHSHIT WERE DOWN WHATTF IS GOING ON ”, it’s really worth asking yourself some fairly basic stuff like • Is GitHub Down? • Are our bills paid up? • Did an upstream package change? • Is somebody DDoS-ing us? and even “ Is AWS having problems? ” And yeah, before you go there, we  could  automate/work around the above stuff. We have too, up to the point where the intermittent pain just about balances out the effort involved 

Process Is A Good Thing

Image
(Yeah, the above image is about  The Landing on the Hudson . We’ll get to that in a bit) How many times have you heard people complain about  Process , about how it  “stifles my creativity” , “ it’s always getting in the way ” and,  “if it weren’t for the process I’d be Getting Things Done” ? Don’t get me wrong — I agree that  if  it’s a bad process, and  if  it’s implemented poorly, and  if  the reasons are long since dust in the wind,  then , yeah, it’s a Bad Thing. But, then again, there are a lot of “ if ”s in that statement. And the reality is that most of the time, the thing that annoys people isn’t  The Process  in general, but  the specifics of the process as it applies to their very unique needs . • Alice doesn’t like having the standup at 8:30am because, well, who is actually sentient at 8:30am? • Bob is fine with the standup, but doesn’t like having to tag every commit with a ticket no. • Caro...

AWS, Data Transfer Costs, and Options

Image
I’ve written about  Data Transfer Costs  in the past , and how they’re probably the premier source of unpleasant surprises when the AWS bill shows up. There are many many sources for these, but, in my experience, the vast majority of them come down to • Traffic to public IPs, and forgetting that  any  traffic to them, even from your own instances, counts • Static assets on EC2 which you haven’t moved to Cloudflare. Especially when you’ve got videos (ouch!) • Multi-AZ deployments, and in particular, multi-AZ RDS, where every damn write involves data-transfer costs. So fine, you know the above, what’re you going to do about it? You  could  stay on top of it with  Cloudability  or  CloudCheckr  or some such, but this really requires you to  seriously  stay on top of it. And the issue with this i s that relying on always doing the right thing to keep your costs down  will  end up ‘sploding in your face , and pr...