The Seven Rules of Highly Successful (Cloud) Services
So you've got a cloud-based service. Good for you! I bet its brilliant, is brilliantly fault-tolerant 'cos of Erlang/OTP, and is going to make you Pots o' Money! That is, of course, assuming that you have prepared yourself for the inevitable "Oh S**t!" moment, when It Goes Down. Herewith a couple of pointers to help you prepare for everything not working exactly the way it is supposed to (Servers die? really?)…
1) Make sure that you understand your entire system. Done? Good, now make sure that all of your developers understand the entire system. Specialization might be good (and don't get me wrong, weed-whackers are *very* useful little tools, especially when you get into the weeds :-) ), but if they don't grok the big picture, well, you're hosed. Capturing vast quantities of scoring data could be really useful for GUI updates, and might be easy to implement from an API perspective, but if your persistence layer can't handle the load, well, you're hosed.
Speaking of the developers, be sure to
2) Involve the developers - contrary to what you may think, their aim in life is *not* to bring your servers down, and they *dont* think Quality is a four-letter word (Really). In fact, they built the service that is currently paying your salary, and they they probably know more about failure modes, choke points, and load-related issues then they are letting on :-)
That said, you may run into perfectionists, who will actually over-engineer the system, so do try to remember that
3) Good enough is good enough. Really. I *know* you want to package up your goods as soon as they are ready, but the FedEx guy is only going to show up at 8pm, so you might just want to do it once an hour, eh? Seriously, there are many, many cases where your business processes (or heck, your GUI!) will more than cover for a problem. e.g., "We are still processing this data" is a better message to display than "NullPointerException". Especially if the service is back up in a few minutes. You know, and I know that it was a null pointer exception, but unless it causes actual pain and suffering to the client, It Just Doesn't Matter.
Then again, sometimes its not your fault, your Cloud provider is causing you agida.
4) Just Make Do. Seriously, sometimes, you just have to make do. Everybody knows (or should know!) about Amazon's issues w/ EBS), but really, everybody has issues of some kind or the other. In the end, sometimes, you just have to Make A Decision, and odds are that it is just not going to be cost-effective to split your service across multiple cloud providers. If thats you (and it probably is), then suck it up, and engineer your system to take the deficiencies into account.
Which leads right into
5) Don't forget the lower comment denominator in the system. If you're trying to build a system with end-to-end reliability, and your transport is the Internet, you have some serious designing to do! Depending on the specifics of your service, you may have to do anything from tweaking your GUI to caching data, modifying your application logic, etc. It just depends.
Mind you, its not just the Tubes that might go down.
6) Assume that something you control is going to die - you're not going to be wrong, because it probably will, and at the single most inconvenient possible time at that. If you really really want to have that Christmas Eve dinner with the family while off the grid in Idaho, guess when your primary database server is going to screw the pooch? Towards that, build your service so that you are not reliant on this one server (or any single point of failure, for that matter). In fact, train up your Chaos Monkey to randomly disable servers, switches, routers, and come to think of it, randomly kill processes on your servers at that. You'll be thankful…
7) Have a process. Regardless of which of the above you do, make sure that you have a process that you follow for every Service Incident. For reallies. Test it. Do a few dry-runs. Tweak it so that it seems to work. Then, after every service incident (trust me, you'll have one), go back and take a look at what else needs to be tweaked in the process - Did you forget to tweet updates? What about telling your clients that they can use the backup service? etc. Remember, you are never going to get it Just Right. What you are trying to do is get it close enough. And for heaven's sake, once you've got the process, write it down! Make a list!
The bottom line? It doesn't matter how perfect your service is, this is most certainly an imperfect world - and you have to live in it...



Comments