Tear Down the Test Pyramid

When somebody starts talking to you about a pyramid, be suspicious. Pyramids are notorious for being straight up scams (aka Pyramid Schemes), or just simplistic and unreliable metaphors like Maslow hierarchy of needs, the Food pyramid, or the Test Pyramid.

Here’s the thing. Tests serve a purpose, but they also come at a cost. They help you build confidence in the correctness of your solution. But time spent writing tests is time not spent making your solution better. You users don’t use tests - they use features, so the time spent on testing should be worth it.

Let’s talk about building confidence. There are two facets to it. First, you want to make sure the code you write is correct, and does what it’s supposed to. You do this by writing tests that validate you code. Second, you want to make sure that changes you make to your code don’t break any existing behavior, or cause any unintended side effects. You do this by checking for regressions. If your validation tests ¹ are good, and you keep them around, they can become regression tests. If written with intention, tests can also serve as documentation for the usage patterns, assumptions, and requirements of your solution.

Traditional approaches to testing often focus on the confidence, with the documentation as a potential by-product, often overlooked and unrefined, while approaches like TDD (Test Driven Development), effectively put the documentation aspect ahead - driving the development.

purpose of testing diagram: documentation (usage, assumptions), confidence to use (validation), confidence to change (regression).

What makes a test good? First and foremost, it needs to prove the correctness of you code. It must pass if and only if you code does what it’s expected to do (the hypothesis). It must fail in any other circumstances ². It therefore needs to be

specific: targeted at a particular behavior.
isolated: controlling for other variables, and not affecting other processes or tests that share its environment (e.g. a failed test should not leave the system in an invalid state).
repeatable: so you can reuse it to test different variables, e.g. different environments, times, and code changes.

All tests must prove correctness, be specific, isolated, and repeatable. But to be practical, they also needs to be relevant, provide fast feedback and be low maintenance. Those last three are often in tension with each other, and you’ll need to find a good balance between them.

A test is more relevant the closer it is to reality. The more similar a test of a behavior is to how users are going to experience that behavior, the better.

The feedback loop consists of two aspects - execution (or the time from making a code change to having actionable test results), and the authoring (or the time it takes to create the test in the first place). The quicker - the better.

The maintenance cost is the the amount of change to test code, due to change of the product code. Here too, there are two indicators: changes to tests that are due to change in behavior, and changes to tests that are not a result of a change in observed behavior. Both should be minimized, but while the former is inevitable, the latter often just a burden.

diagram showing good tests at the center, and arrows pointing out to low-maintenance, high-relevance, and fast-feedback, in different directions

Balancing these tensions is often done by controlling the scope of the test. End-to-end tests are the most relevant, but are difficult to control for specificity and isolation, and often harder and longer to set up and execute, so you make tradeoffs:

Replace parts of the environment with more finely controlled components (doubles).
Extract components and pieces of logic to achieve finer control of the component itself, and compensate by testing the seams separately.

But these tradeoffs mean higher maintenance, and lower relevance. Tests that are too granular, or that rely heavily on doubles, are very tightly coupled to the implementation, and become brittle - any change to the code, has a ripple effect throughout entire tests suite, making them harder to maintain, reduces the confidence in their quality, and diminishes their relevance. They might all be working well in isolation - while the system as a whole is broken.

To achieve a sustainable quality, aim for the minimal acceptable level of confidence: the most relevance within a reasonable feedback loop speed and maintenance cost, and prioritize testing behaviors over implementations.

It’s not a pyramid. There’s no hierarchy. It’s tradeoffs all the way down. A mix of levels of isolation and control to achieve just enough confidence to allow you to continue to ship and to evolve a useful product.

Observations:

With modern tools like containerization, it’s relatively simple to achieve good levels of speed and specificity using “real” dependencies, like databases and web servers, without needing to “mock” repositories or web clients, or use semi-functional in-memory databases. Allowing for a much more robust test suite focusing on what would traditionally be categorized as “integration tests”, making a lot of traditional “unit tests” unnecessary.
The “Unit” in “unit test” is often (wrongly) interpreted as “class” or “function”, which often leads to over-granular, brittle tests, that are mostly useless and hard to author and maintain.

More raw thoughts in Musing on Testing

Or acceptance tests, as they often validate the acceptance criteria your solution needs to meet to be considered “correct”. ↩︎
It’s follows therefore, that tests must fail by default. Think what it means about your test harness of choice. ↩︎