Latency, Outliers, and Percentiles

The good news around metrics is that it is increasingly easy to instrument and monitor your code, with plenty of usable software out there like Prometheus, TICK, OpenTracing, and whatnot.
The bad news is, well, pretty much the same — it’s so easy to instrument and monitor your code that the sheer amount of stuff that you can capture is mind-boggling. It is so easy to get caught up in all the charts and graphs that you will find yourself constantly losing sight of the the forest for the trees. All that data ticking around on the screen can, counter-intuitively, mean that it is actually harder to actually get any information (let alone knowledge!) from the metrics.

Latency
Latency is a particularly good example of this. Most everyone out there measures this as an average, and while that’s better than nothing, it can easily hide issues. The fault, as always, lies in us — in that we tend to think about our systems in terms of steady-state performance, with statements like “on average, …” being pretty much par for the course.

Consider outliers, which almost always end up getting masked in the average. With latencies, that occasional 4 second response will get completely swamped by the thousands of sub-millsecond responses. Now, do you care about this?
• If you do (and if you don’t, are you sure you don’t?) then simply looking at the averages tells you nothing about these outliers.
• And, if you don’t (I say again, are you sure?) then you still need to make sure that your outliers aren’t skewing your averages. And the only way you know that is by measuring them…

The bottom line is, you need to actually know what your outliers are, regardless of whether you care about them or not. Always remember, “Averages Hide Outliers, Outliers Skew Averages”.

Percentiles
When you level-up on your system design, you move beyond low average latencies, and start thinking about low and/or predictable latencies at the tail end of the distribution.
Why? Because issues here — which hitherto were just “outliers” — are the ones that will cause you the most trouble.

Happy people rarely complement you, but unhappy ones tend to complain long and loud (squeaky wheels, etc.). They will be the ones that overwhelm your customer support operations, post negative reviews about you, tell their friends to not use you, etc. etc. Keeping them happy can dramatically simplify your life, reduce your customer support costs, increase your reputation in the biz., etc. etc.

Mind you, the way you keep your customers happy is by making sure that your systems don’t barf at the tail-end of your latency distributions. Current trends in system-design — micro-services / lambdas / … — have dramatically changed the impact of latencies. When a single “big” task spawns hundreds/thousands of individual services, each of which is chatty and exchanging messages — well, in this world, the low probability outlier is now a high probability event for the “big” task, and is also likely to swamp your performance.

Mind you, the naive approach to dealing with the “behind-the-scenes” issue is to Hedge Your Bets by throwing the request at multiple backends. This can actually end up being a seriously bad idea, especially if your long-tail latencies are application induced! The addition of traffic to the system, along with the extra scatter/gather involved in orchestrating the “big” task can easily saturate other resources, create emergent issues.

The point is that these outliers exist for a reason — and without understanding the root-cause for these outliers, you’re not easily going to be able to deal with them.

Which brings us right back to metrics. And not just metrics, but the ones that allow you to capture latencies at the tail end of your distribution — say, the 95th, and 99th percentiles (or whatever makes sense for your specific case). You are capturing, and using, these, right?

Note: For more on this, read Richard Hsu’s “Who moved my 99th percentile latency?”.

Search This Blog

And you are here why?

Latency, Outliers, and Percentiles

Comments

Popular posts from this blog

Cannonball Tree!

Sysadmin Day - July 27th

Skimpy armor - and women in comics