Connectivity Is Everything
![]() |
/via http://www.commitstrip.com/en/2012/03/14/whatever-it-takes/ |
I was supposed to be on vacation. Well, technically I wason vacation, but the point here is that I should have been doing the stuff one does on vacation, like eating (and drinking!) too much, hiking, waking up late, and so on. Instead of what I actually was doing, which was trying to figure out how to bring our servers back up from where I was in the middle of nowhere.
Let’s rewind this story a little, and put some context out there. At the time, I was the CTO of a phone company¹. Not just any phone company, we were one of the CoolAndInnovative™ ones — so cool that I had been named one of the “Top Influencers in VoIP” (which basically meant that people laughed openly at me, instead of behind my back).
Anyhow, we had a bajillion VoIP phones scattered around the country, each of which connected to our data-center over the InterTubes. Our data-center mind you, as this was before AWS was much of a thing. Which meant that we were responsible for the care and feeding of everything, from the network to the servers to the cooling, power supplies, and whatnot. Great fun it was too .
Anyhow, we had a bajillion VoIP phones scattered around the country, each of which connected to our data-center over the InterTubes. Our data-center mind you, as this was before AWS was much of a thing. Which meant that we were responsible for the care and feeding of everything, from the network to the servers to the cooling, power supplies, and whatnot. Great fun it was too .
So vacation rolls around, and I decide to head to Ladakh — a fairly remote part of the world, where the Tibetan plateau makes its way into India. It’s pretty high up (Starts at around 13K feet, and goes up from there), and pretty close to the Pakistani border, which means that there are Indian Army outposts all over the place. The best part, of course, is that there really wan’t much in the way of network access anywhere there, which meant that I was going to be actually disconnected from work for a week or so.
Mind you, I’d been on vacations before, but never one that I was genuinely disconnected. And yeah, we had all sorts of policies ad procedures and backstops around “Hit By A Bus” syndrome, but, well, we’d never actually tested them in extremis. Or more to the point, all the Resiliency Testing that we’d done involved things that we could dream up, which, by definition, is not complete . But, we (I!) had enough confidence in our systems and our people that I went ahead on the vaccy.
Fast forward around 7K miles over and 18K ft. up, we’re hanging out at Khardungla Pass, which really is way up there. It’s amazingly beautiful, and — at least back then — amazingly remote. But, through some freak of weather, I get a call on my phone (which was on. Why? Damned if I know) from work saying “
Emergency. We’re down.
”
In the extremely terse conversation that followed (remember, I had no idea how long the call would stay up!) it turned out that the service had successfully black-swanned . The details aren’t relevant, but it was one of those cases where something had crapped out, and couldn’t be brought back up because the SysAdmin was out with the flu, the backup was out of pocket, the dev lead had accidentally turned his phone off, the ops guy was…, you get the point. One of those genuinely bizarre combination of events that, well, we just hadn’t predicted. In desperation, they had called me too, and — wonders of wonders — i’d answered.
And, of course, the call dropped right about then — I couldn’t get back on to walk someone through debugging . Even better, I have no idea of what I can do — the hotel is about 2 hours away, and doesn’t have network access anyhow. Hoping against hope that there was someplace in town that I could find an Internet Cafe or some such, we start driving back when, wonders of wonders, I find an open WiFi signal half-way down the mountain!
(I, genuinely, have no clue who or what it was. Mind you, there was an Indian Army outpost nearby, but why on earth would they have an unprotected WiFi AP? Then again, who am I to second-guess network security, or lack thereof for them…)
I pull up
ssh
on my Treo (yes, you really had to have been around back then. It was…weird), VPN in to our datacenter, and mirabile dictu, manage to get stuff back up and running. Again, around 5 minutes before the WiFi signal vanishes, which I’m guessing because there was somebody going “WhyTF is this not protected and who is using it shut it down SHUT IT DOWN”.
The fun part to all of this is that, in its own weird way, things actually worked out! Our system had been architected to the point where I managed to track down the problem — an obscure configuration bug — quite rapidly based on all the information at hand (Observability!), deploy a patch and reload the systems (Continuous Deployment!), all this well before the days that these terms were a thing.
Could it all have been done better? Well yes, of course! That said, we were inventing a lot of the tooling around Resiliency, Robustness, CI/CD, Observability, and whatnot in real time back then, and within those parameters, I’d say we did OK.
Could it all have been done better? Well yes, of course! That said, we were inventing a lot of the tooling around Resiliency, Robustness, CI/CD, Observability, and whatnot in real time back then, and within those parameters, I’d say we did OK.
Mind you, I’d rather not have to go through that again
- 1. If anybody, suggests that you get into the phone business, back away without making contact, and at the first available opportunity, RUN.
Comments