Flaky Tests — The Bane of Existence

We’ve all had to deal with flaky tests — tests that don’t consistently pass or fail, but are, instead, nondeterministic. They’re deeply annoying, end up being a huge time-suck, and inevitably end up occupying the bulk of your “productive” time.
There are many, many reasons for flakiness, but in my experience, the vast majority of them can be boiled down to some combination of the following
  • 1. External Components: When the code relies on something that isn’t under it’s control, and makes assumptions about it. I’ve seen people validate internet access by retrieving http://google.com (“because Google is always up”, conveniently ignoring the path from the test environment to Google), assume that there is a GPU present (“because it’s Bob’s code, and he always runs it on his desktop”), and so forth. The thing is, these assumptions get made even after stubbing — our assumptions about the environment we’re in can frequently lead us to places where we don’t even realize that we are making assumptions!
  • 2. Infrastructure: This is a variation of External Components above, and is particularly relevant in our CI/CD/CT world. There is always something barfing out there in the pipeline
    • CircleCI is down,
    • or it’s up but they’ve run up against a resource limit,
    • or they upgraded their default version of Docker,
    oh, the fun is endless!
    Mind you, all is not lost — you learn to work with the vagaries of your popeline, e.g., over-specification of package versions, installing your own Docker environment on each test run, etc. But it is in getting to this place is where the flakiness lies.
  • 3. Setup/Teardown: Theoretically, every tests sets up it’s environment, and then tears it down. Until one of your resident #CowboyDeveloper decides that teardown is somebody else’s problem, which that comes into violent conflict with the other #CowboyDeveloper who firmly believes that setupis somebody else’s problem!
    And no, you don’t actually need the latter. The reason you have tests do their own teardown is to reduce the bug surface — yes, if everybody did their setup correctly, then teardown wouldn’t be necessary, but then again, if everybody wrote their code perfectly the tests wouldn’t be necessary either 😆.
  • 4. Complexity: This, in many ways, is a catch-all. The more complicated your tests, the more likely it is that you really haven’t thought through all the ramifications of what it does. Strictly speaking, the test isn’t flaky — it’s just way too large to reason about well.
    Measuring complexity is inherently subjective, from the Oliver Wendell Holmes approach (“I know it when I see it”), to lines of code (“it’s more than a screen-full”). The one I like a lot is the Google approach — where they identified that there is a direct correlation with test size and flakiness. The point being that you should look at how much RAM/CPU the test uses — the more it uses, the more likely it is to be flaky (thus conveniently avoiding the subjective measures of complexity!).
    The point, of course, being that you should try to keep the tests as small/low-usage as possible…
The bottom line here is that flaky tests are here to stay — the above issues are more about identifying the deficiencies in your code/process, and correcting for it. For example, as you clean up your Infrastructure issues, you’ll start running into issues with Setup/Teardown. Fix those, and you’re now dealing with Complexity. Flakiness never goes away, it just moves up the food-chain…
Coda: There is an entire universe of pain associated with flaky testing on distributed systems, involving assumptions about consensus, ordering, duplicate messages, transactionality, and whatnot. If you don’t have flaky tests here, then you’ve probably screwed something up 😇

Comments

Popular posts from this blog

Cannonball Tree!