Percentiles and Distributions

/via https://flowingdata.com/2017/05/02/summary-stat/
If you don’t pay attention to your’ data’s distribution, you run the risk of focusing on the wrong thing entirely. The perfect examples here are the datasets generated by Justin Matejka & George Fitzmaurice (•), where, thought wildly different, each dataset has the same summary statistics (mean, standard deviation, and Pearson’s correlation) to 2 decimal places!
To really belabor this point, absent knowledge of the actual distribution, you shouldn’t rely on baseline summary statistics. Take averages for example:
  1. 1. Outliers can — wildly! — skew your averages. You walk to work 200 days a year, and fly cross-country to the HQ in Los Angeles once a year. What’s the average distance to work?
  2. 2. On the other hand, you averages can — totally — hide your outliers. Your DMV processes 290 out of 300 applications in less than 15 minutes, but the other ten take greater than 2 hours. How satisfied are those last 10 people?
I could go on about this, but I’ve already done so before — go read “Latency, Outliers, and Percentiles”, it shouldn’t take more than a few minutes. The TL;DR from it is that
Happy people rarely complement you, but unhappy ones tend to complain long and loud (squeaky wheels, etc.). They will be the ones that overwhelm your customer support operations, post negative reviews about you, tell their friends to not use you, etc. etc. Keeping them happy can dramatically simplify your life, reduce your customer support costs, increase your reputation in the biz., etc.
…the way you keep your customers happy is by making sure that your systems don’t barf at the tail-end of your latency distributions.
So what’s the problem here?
Well, the system that you’re using to capture your metrics, and take a look at those all-important percentiles — does it store the original metrics? Or is it aggregating the data right off the bat? The thing is, you’re probably storing the raw data early on, but data can get expensive (yes, even data!), and the odds are that after some period — days, weeks, whatever — you’re probably processing/aggregating it at a couple of different resolutions.
Which is all well and good, right up to the point when you want to zoom in or out on your percentiles (go from 99.5% to 99.9%, calculate latencies acrossa few metrics, etc.). And this is where all hell breaks loose, because, averaging/recalculating percentiles requires the original dataset!
Lacking the original dataset, you’re just doing bad math, and the results you will get could mean, well, anything, and depend utterly on the underlying distribution. Mind you, if you know what the underlying distribution is, you can make reasoned judgements about the percentiles, but that will require math. Otherwise, the approach of “Just zoom in to the historical data to look at the latencies at higher resolutions” is pretty much pointless if the data has already been aggregated!
All is not lost however. There are two approaches that you can use to deal with this
  1. 1. Math: As mentioned earlier, if you already know the distributions, and/or have done some level of histogram binning for the old data, you can extract a meaningful signal from the noise at higher resolutions. Do your homework on this though!
  2. 2. Don’t Aggregate: Data is expensive only if you’re storing it “live” (e.g., with Prometheus, it usually lives on expensive SSD). However, if you can archive historical data into S3, and use something like Thanos to get at the original resolution when necessary, well, you’re home free!
The bottom line here is that — one way or the other — you need to care about your data distributions…

Comments

Popular posts from this blog

Cannonball Tree!

Erlang, Binaries, and Garbage Collection (Sigh)

Visualizing Prime Numbers