“Trust But Verify” Your Metrics

Let’s take for granted that you’ve done the right thing — you’ve generously instrumented your system, and are actually paying attention to the metrics that you’ve generated (•). The question on the table is — “Do you actually trust the metrics that you are generating?” (Hint: You shouldn’t)
Let’s look at something fairly straightforward, the request/response path as shown below
You would think that the ResponseTime would be the sum of each of the processing stages, right? i.e., ResponseTime = 10 + 1 + 20 + 1 + 5 = 37ms?
But, since you shouldn’t trust your metrics, you
1) Also measured ResponseTime directly, and
2) Compared it against the what it should be, and
3) Charted/alerted on deviations
and you found that the actual ResponseTime was, say, 52ms.
That’s quite a difference, no? As to Why it was 52ms, let’s look at a bunch of possible issues
  1. 1. Measuring the Wrong Thing: You actually instrumented something completely different. I know, that sounds goofy, but it happens all the time, e.g. you’re measuring the validate_user interval instead of validate_users (spelling issues with APIs. yay.)
  2. 2. Incomplete Instrumentation: There’s a queue in front of the Analyzecomponent that you haven’t instrumented, and you’re not measuring the latency there.
  3. 3. System Issues: Oops. A garbage collection pause. Or a failover. Or a restart. Or whatever.
  4. 4. Unexpected Code Paths: Your code has a bunch of paths in it to deal with edge-cases (e.g “strip semi-colons from the input”), and some of these trigger additional steps that you had forgotten about
  5. 5. Time Issues: You just plain screwed up by making one of the — infinitely many — assumptions about time, such as that it increases monotonically, or everything is GMT, or whatever.
And this is just when it comes to measuring time. The point here being that you should be validating your metrics through multiple means, for all your metrics. In fact, if you are already doing this, and all the numbers line up, you should be very very worried — you’ve probably missing something!
So yeah, trust your metrics, after you’ve verified them…
(•) You’d be surprised how often I see this missed.
Did you instrument your code?” — “D-uh, of course!”
Grafana?” — “Dude, come on, what d’you think I am?”
When was the last time you looked at it?” — “Uhhhhh”

Comments

Popular posts from this blog

Erlang, Binaries, and Garbage Collection (Sigh)

Its time to call Bullshit on "Technical Debt"

Visualizing Prime Numbers