Functional tests and flakyness

I just stumbled on a nice article that Martin Fowler has had on his website for a few years about non deterministic tests. It’s a good read and it addresses something that I have encountered in multiple projects. Flaky test are indeed a problem in many places and I’ve had the ‘pleasure’ of dealing with such tests myself on a couple of occasions (often of my own making even).

Martin Fowler lists a few ways to mitigate this problem and his suggestions are excellent and well worth reading. But I have a few things to add that are not covered in that article.

There are only two types of functional tests

There are only two types of functional tests that are worth having: unit tests and integration tests. Anything in between is technically a waste of time and resources since it neither succeeds in testing all code paths, which is what unit tests are for, nor succeeds in emulating realistic scenarios, which is what integration tests are for. So, shoot for realism or coverage but not both at the same time. If you stick to this, you’ll end up with a more effective suite of test.

Unit tests are intended to test a function in pure isolation from anything else. The whole point of a unit test is coverage of all relevant code path. Unmocked/non-fake dependencies are not allowed here because they add unneeded complexity. The goal is to test all possible code paths through your unit with as many parameter variations as you can get away with. By definition this gets progressively easier on smaller units in a non linear way. The smaller your unit, the easier it is to cover it with tests.

Any time you have a unit test that needs a database or half of your system initialized, it just stopped being a unit test and became an integration test instead. Probably a pretty bad one too. If you are going to write an integration test, do it properly.

The goal of integration tests is to scenario test a larger piece of code in a way that is as close to what might happen in a production environment as possible. By definition your code base is too large to test all code paths in integration tests (combinatorial explosion of paths and tests) so you instead focus on testing ‘good’ scenarios that are realistic and meaningful instead. Complete coverage of all relevant code paths is mathematically impossible with integration tests. So, you need to make the most of the ones you can afford to write.

With integration tests, you are testing features, not code paths. A good test scenario in this context is one that is as similar to what your users will be doing with the features of your product and exercises as many aspects of your system as possible.

Try not to fake unless you have to

Martin Fowler makes the point about faking external systems. I’m on the fence here since often fakes behave different in subtle ways that matter when your production environment falls over because your software never ran against the real thing before it went live. But the reality is that having external systems integrated as part of your build also adds complexity, flakiness, and potentially hard to debug issues. I’ve done both. So, it depends; sometimes you get away with it and you are better off. Sometimes it’s just too much hassle.

A strong argument for not faking things in integration test is that this increases the value of your test. Faking is a missed opportunity to broaden the scope and reach of the integration test. With integration tests you are trying to find the weird kind of failures your unit tests can’t catch that happen when your software is running in production. If you strip away things that are known to be flaky, you’ll find less bugs in your tests but you are still shipping bugs.

Integration tests are more complicated to write and maintain and only test a fraction of the possible code paths through your system. So maximizing the amount of things it actually touches in your architecture is a good thing since you increase the chances that you actually find something that needs fixing. I regularly find and fix bugs because an integration test completely unrelated to the bug unexpectedly fails because it happens to interact with that part of the system as well. Finding such unexpected issues is exactly why you have integration tests.

Another thing with integration tests is that you should run them in a system configuration that resembles your production deployment architecture. This way you test your code and the deployment architecture in a more realistic way; meaning that the kinds of failures you will see in production are likely to also happen in your tests. If you are faking half of your architecture, this is less likely and you might miss bugs that your test code would actually catch otherwise.

No whitebox integration tests allowed

There is a notion of blackbox vs. whitebox tests. A whitebox test is any test where you poke around in your system’s internals to set things up or to assert stuff. A blackbox test on the other hand approaches the system from the outside and assert stuff on the things it returns. In any system I maintain, whitebox integration tests are strictly forbidden. The reason is very simple, a blackbox test does a much better job of being a realistic test and exercises more of the system. The effort to write, run, and maintain them is typically similar. Therefore blackbox tests are preferable, always. Don’t poke around in your database if you have a perfectly good API that achieves the same thing in a way your users would actually care about.

Whitebox tests are technically a missed opportunity to write a better blackbox test. They are often introduced because it seems easier to just hack around part of your system (and thus not test that) or cheaper but the end result is that you end up with unrealistic tests that requires detailed knowledge of system internals. They are tightly coupled to the implementation (like a unit test) and have worse test coverage than either a blackbox integration test or a proper unit test. Any time you refactor, they break. Either way, it’s a missed opportunity to have a proper end to end test. A good blackbox test doesn’t have these problems and stays relevant even if you radically change the implementation. A good blackbox test is essentially a BDD style test and can be used as an acceptance test as well.

Run tests concurrently

Running tests concurrently has many advantages. For starter’s it is faster, which is good because it reduces the time between you making a mistake and finding out about it. But what is even better is that it will increase the likelihood that a flaky test will actually fail. This is a double edged sword if you have a lot of these. But if you fix your tests to not be flaky, you end up with a faster running and more robust test suite. Finally, you are likely testing a multi user system and it’s a bad thing if production is the first place where the system will be tested with multiple users doing things concurrently. Running tests concurrently is the closest you are going to get to having multiple users using your system at the same time unless you invest in dedicated performance tests. The latter is something that people seem to have limited time for and more of aspirational thing rather than something that is actually done often on a lot of projects. I’ve been on more than a few projects where people talked about doing load and performance tests instead of doing them.

We run all our tests (unit and integration) with 50 threads. Using way more threads than cpu cores is normally a bad idea. But when testing it is nice to be able to induce flakiness by simulating load and a little resource starvation. Also this ensures that any tests blocked on IO release the CPU to other tests. Testing concurrently has helped us catch really obscure concurrency related issues, cache coherence issues, overly broad queries, weird side effects, etc. before they happened in production. The fact that I can run 700 API integration tests in under 3 minutes on my laptop is nice as well. Finally, concurrent tests are a great way to identify real performance bottlenecks in your system without actually investing a lot in profiling and performance testing. Any bottleneck in your system will become more obvious when you have 50 tests running at once and it is nice to know your system still functions when there are multiple people active.

Never cleanup and isolate by randomizing

You might at this point wonder about test isolation at this point. This is essential when you run stuff concurrently because things can and will interact with each other, just like in your production system. Martin Fowler mentions two strategies for isolating tests: 1) rebuild state from scratch with every test 2) cleanup after each test. Number 2 is impossible if you are running your tests concurrently because you will end up removing data that is needed by other tests. Besides, cleanup is expensive. I’ve dealt with test-suites that spent 99% of their time cleaning up the database instead of testing stuff. Avoid this at all cost. Number 1 in combination with running things concurrently requires that you can have multiple tests initializing at the same time and that they don’t rely on the same data. So, you need to guarantee that conflicts between tests are impossible. The easiest way to ensure that is simple: randomize all relevant data. This ensures that the probability that two tests use the same data is minimal. Not cleaning up adds a touch of realism to your tests as well: your production system will not be empty either.

There’s more to randomization than simply using a random number generator. Data tends to have meaning and structure. Also what happens if your tests fails only for particular random data and how do you reproduce that? The way we do this at inbot is very simple: 1) we calculate a test seed before any tests run and use that to bootstrap our random number generator. If a test fails, we can grab the seed from the log and rerun the test with the same data. Note there are issues with this in combination with concurrent runs of course. 2) We try to generate human readable data and we use builders for common objects. I actually have a small testfixtures project on github for things like company and person names that supports this. I can spawn users, contacts, teams, etc. with human readable names, with a few lines of code. 3) we log aggressively when things go wrong. A test that failed without data to show what failed is useless.

Keep things fast: don’t sleep and poll instead

The easiest way to fix a flaky test against some asynchronous thing is to add some sleeps in your test. The problem with that is that it makes the test slow and people tend to over compensate when things are flaky. Also, sleeps never seem to get removed from tests. Before you’ll know it you’ll have dozens of tests sleeping for most of their execution for no good reason. It adds up. A better strategy is to implement polling for things that are supposed to happen asynchronously and you always want a timeout with that so that you don’t poll for ever. Polling ensures that you don’t waste more time waiting than strictly needed. We use this in several places and it allows us to integration tests functionality that is fundamentally asynchronous without having to fake our own backend services. Even though quite a few of our tests take more than a second to run, we still manage to run all of them concurrently in under 3 minutes. Reason: polling tests allow other tests to run while they poll. We typically max out all cpu cores through out the test.

Invest in test support code and keep it DRY

This is a lesson I learned a long time ago. The easiest way to create a new test is to copy an existing one and slightly tweak it. However, this creates code duplication and can quickly turn into a test suite maintenance headache. The DRY principle dictates that you should try to not repeat yourself when coding (including by copy paste). This does not stop being true when you are writing tests. So anything that looks like boiler plate or copy paste work in a test is a good candidate for automating with some shared library.

Essentially, if you write BDD style tests, anything that is a given or a when is something you want to make effortless. In our system most tests need users in some kind of team. So we have a TeamApiClient.create(2) call in a lot of places that automates the creation of a team, and it’s users and exposes UserApiClients with all the right credentials and ids already setup. This allows us to quickly setup all sorts of scenarios with typically only a few lines of code. Likewise, we typically want to assert stuff on what comes back so we have support for that as well.

In my experience time you spent making testing easier is always well spent time because you end up with more, better, and easier to write and maintain tests. This improves team productivity. Writing tests should be easy and you should remove any and all obstacles for writing tests, no matter how small.

Write testable code: high cohesiveness, low coupling, low number of dependencies

This may seem obvious to veteran engineers but it’s a common mistake to write code that is not unit testable, hard to fake, and expensive to integration test. Half of the success is writing code that is actually easy to test. Learning how to do that is a good skill to have. When you see a project that has more integration tests than unit tests because they are easier to write, you know there’s a testability problem. Unit tests should be a no brainer; they only become tedious when the code is untestable. Good or bad code is partially a matter of taste but mostly a matter of metrics. Understanding complexity metrics, cohesiveness, and coupling metrics means that you’ll be able to articulate why something is good or bad. Something with 20 dependencies probably has high complexity because it needs a lot of stuff to do whatever it is doing, low cohesiveness because it is apparently doing a lot of stuff that is probably not strongly related, and tight coupling with other parts of your system because there are probably a lot of public entry points into the thing and it is using a lot of things itself.

Anything with bad metrics like this is something that is hard to deal with in tests and probably something you should be fixing. That’s not an opinion or my personal taste but cold hard reality of strong correlation between maintenance cost and poor scores on metrics like this. You don’t need tools to tell you this: you can pretty much judge the quality of a piece of code by the length of its list of imports and the number of lines of code. The number of commits affecting a file is also a great metric (high is bad). Something with 5K lines of code is going to have shit cohesiveness and loads of dependencies. It’s probably quite complicated and chances are world plus dog is also depending on it. That’s what is commonly referred to as a spaghetti ball and they are a testing nightmare.

Small units are easy to test because they are highly cohesive, have a low number of dependencies, and are therefore not tightly coupled with anything else. If it is hard to test, likely your problem is one or all of these three metrics. This is true at the micro (functions,classes) and macro level (services, components, APIs) . The complexity of your unit tests is directly correlated with how good your code is on these metrics. Low amount of dependencies means you don’t have to mock/fake a lot of things, low complexity means you only have a handful of code paths to worry about. Low amount of lines of code means the unit you are testing is small and the amount of test code you need to test it is equally small.

Having good unit tests means you don’t have to repeat what they are testing in your integration tests. This frees you up to write better integration tests that test stuff that isn’t easy to unit test. So having an integration test that asserts an algorithm is doing what it should be doing is waste of time. Algorithms are much easier to test with unit tests.

Test performance matters

Speed matters. You don’t want to wait any longer to find out that you broke something than strictly necessary. I’m really happy with test performance at Inbot but it wasn’t so always. We can introduce a change and have it in production in 15 minutes. During this time it goes through full CI, deploys to our testing environment, we do the pull request to our production branch, which is picked up by our CI again and deployed. About half of the time we spend polling amazon. But the best part is that we find out in under 3 minutes whether any tests are broken. So, the probability of finding issues drops off rapidly throughout the process. We trust our CI and deployment well enough that we do this multiple times per day, even for small changes. Time to market matters and waiting for tests should not be dominating the time it takes.

I’ve been on projects with flaky, slow test suites. Inevitably that becomes the main problem to fix before you can make meaningful progress. A slow, flaky test suite can really suck the joy out of a team. Fighting a whole afternoon to get tests to go green after they go red for no good reason is neither fun nor productive. If that’s a regular thing in your life, consider the above as some friendly advice on fixing that.