Continuous Testing — Some Semantics

If you’re doing any form of Continuous Testing (or heck, even just automating your tests), you’ve probably already run into the Semantics Gap, where what you mean by XXX isn’t what others mean by it.
This can be quite the killer, in ways both subtle and gross. I recall a post-mortem in the past that boiled down to the QA and Release teams having different assumptions about what “The Smoke-tests passed” meant. The resulting chaos — both between the teams, and for the customer — was epic, and something that still makes me shudder reflexively when I look back at it .
And that, my friends, is just about when I put together the following terminology.
Mind you, far be it for me to tell you to adopt this terminology. Heck, you may very well vehemently disagree with it — and that’s ok. The thing is, whatever terminology you use needs to be agreed upon by everybody! (And, you’ve probably got all the same stuff below, just broken up somewhat differently ).
So, take whatever you’ve got, and make sure that
1. Everybody agrees with the way you’ve broken it up, and 2. Everybody agrees on the nomenclature!
The key here is the “Everybody agrees” bit, and for that to really take, make sure that you document this. Without the agreement (and proactive acknowledgement from every party), you’ll have the equivalent of this FarSide cartoon, where what you say, and what they hear, may not necessarily be the same.
And that way lies disaster.
So yeah, anyhow, this is the way I’ve broken it up, and what I call it…
  • • Smoke Tests: Here, you verify as much of the most common paths of functionality that you can, within a time-boxed boundary (say, 10 minutes). This should at the very least, include all the unit-tests, and should always be run by the developer before a PR.
    The key here is that this is a Quality Gate — If the tests don’t pass, the PR doesn’t merge.
    Note that this doesn’t imply that only developers do smoke tests — this is more of a minimum condition!)
  • • Regression Tests: These cover testing scenarios that take “longer”, and include things corner cases, the regression tests, integration tests, upgrade tests, etc., and whatnot. They are frequently run as “nightlies”.
    The key here is that this is a Functionality Gate — you use these to make sure that you haven’t dropped — or changed! — any business functionality (“Whoops! What happened to the ability to change teeth color?”).
    Note that this has a huge workflow impact, as you need to be able to deal with failures. What happens then do you freeze commits? Or revert the offender? The specifics will vary with your unique use cases…
  • • Deep Regression Tests: This is when you go through tests that thoroughly describing the code-base, and have asymptotic time requirements. It’s the province of property testing and such-like, and is where you usually end up finding memory leaks, deadlocks, recovery failures, and whatnot.
    The key here is that this is a Correctness Gate — you use it to make sure that the system is not misbehaving in the long run.
    We usually do this by running multiple deep regressions as a pipeline. For example, at any given point in time there are 12 “long running” tests going, with the oldest running for, oh, 25 days. You can further segregate these into a generational system (“if a test survives more than 5 days in the ShortQueue, it moves into the LongQueue”). The details, again, will depend on what you do.
Two other terms that are relevant at Stress Tests and Soak Tests.
  • • Stress Tests: These are tests that we use to explicitly tests the limits of the system. We actually do them in two different ways
    1) We embed them into our Regression tests, to make sure that the existing “safe operating limits” still hold.
    2) We, orthogonally, check to find out what the upper limits are (Otherwise, how do you know what you’re “safe operating limits” are?). Mind you, sometimes the upper limit is more of a “it’s over 500tps, f**k it thats 100x anything that we care about” situation, and that’s ok too 
  • • Soak Tests: Here, we check our system against production loads, so that we can validate behavior under “real world” conditions. This can include not just peak loads, but burst-y traffic and whatnot (using replayed traffic, mock users, etc.)
    Technically speaking, this is part of a Regression Test. It’s just that sometimes soak tests can be hard to replicate (especially if you do them via canary testing and/or experimentation, or the traffic is truly random, etc.). I find it useful to separate Soak Tests out into their own thing under these circumstances.
  • • Resiliency Tests: Tests we run designed explicitly to validate the resiliency of the system under conditions of stress. These are orthogonal to the Stress Tests above, and are explicitly done as experiments (“What happens when we drop a node? Two nodes? Three nodes?…”, “What happens when we have grey-scale scenarios, like CPU pegging? OOM?” etc.).
    The really important thing here is that unlike all the earlier tests, we run these in production too! There is only so much that we can test based on simulation— real users end up generating bizarre scenarios in ways that we can’t even imagine…
    (Note: Start these only when you have all the previous tests nailed down. You need serious levels of observability, experiment isolation, impact analyzability, circuit-breakers, and whatnot waaaaay before you start this)
And there you have it. Your specific levels may vary, as might your vernacular, but as long as you have the levels and terminology made explicit, you’ll be safe!

Comments

Popular posts from this blog

Cannonball Tree!

Erlang, Binaries, and Garbage Collection (Sigh)

Visualizing Prime Numbers