Flaky Tests: Why They Happen and How to Fix Them
There isn’t a single actively maintained automated test suite on the planet that hasn’t experienced the headache that are flaky tests at least a couple of times. These are the test cases that decide to pass for days or weeks without a hitch, only to fail for some unknown reason on one test run. Then when someone looks into the failure, the test miraculously begins to behave and works again without any intervention.
For most teams, flaky tests are just an annoyance. Because they don't fail consistently and clear up on a re-run, it's hard to tell whether a failure is a real bug or a temporary glitch. Since we don't know the cause or how long a fix would take, we live with it. That's how software and QA engineers end up carrying these issues for months or even years.
However, they’re more than a temporary irritation. I’d argue that flaky tests are worse than a test case that’s obviously broken for a reason. A test case that fails due to a change in the underlying system, like the code or infrastructure, has a clear indicator about why the failure occurred and gives us a lead on how to resolve the issue. On the other hand, a flaky test doesn't point to any change at all. There's nothing obvious to investigate, and re-running makes the failure vanish before you can dig into it, so you're left with no lead and no failure to follow.
The Real Problem Behind Flakiness
The real problem behind flaky automated tests is the typically unnoticed changes in behavior they cause in development teams. It trains developers and testers to ignore the test suite when failures occur by automatically re-running failing test runs instead of checking if there’s a real test failure. Our brains can go into autopilot, and we opt to start again from a fresh state, potentially leaving a ticking time bomb hidden for a long time.
You may not think this often happens, but it’s surprising how we can fall into this trap. I once saw a developer re-run a broken test suite five times in a row before realizing their latest code commit introduced a regression that the test run was alerting them about. The test suite was prone to flakiness every couple of days, so the team was used to clicking on the “Re-run” button on CI without giving it a second thought. If the developer had taken a look at the test run results the first time it failed, they'd have caught their regression on the first failure instead of the fifth.
Even worse than mindlessly re-running tests is that a flaky test tends to erode trust in the test suite over time. Any time a developer makes a commit to a repo that suffers from flaky tests, the team slowly adopts the bad habit of ignoring the CI results. For instance, I once worked on a project with flaky end-to-end tests. When we neared a deadline and were caught in a time crunch, a developer committed a bug that surfaced in a test failure on CI but was ignored due to past flakiness. With the deadline looming, the update was pushed to production, where it affected the application’s customers almost immediately and had to be rolled back in the middle of our night, which is never fun.
What Causes Test Flakiness and How You Can Address Them
These problems should be stamped out as soon as they show up, before the workflow starts to suffer. The following are the causes I run into most often with a flaky test suite, along with the tips I use to correct them.
Timing issues
Some automated tests depend on scenarios that can vary wildly in the time it takes to run every time you execute them. For example, you may have integration tests that need to communicate with external systems and services that aren’t under your direct control and can take longer to respond than you’d like. Another example is animations and transitions that are used in modern web applications, which can trip up a test that expects an element to be present on a screen but it hasn’t loaded up yet.
There are a few ways to minimize flakiness for these timing-based scenarios. Many test frameworks have built-in functionality to poll for a condition and proceed as soon as it's met, up to some timeout. This is far more reliable than a fixed sleep, which forces you to guess a duration that's either too short (leading to flakiness) or too long (leading to slow tests). Another potential solution is to mock external interfaces instead of hitting real endpoints, which needs careful consideration as it can surface other problems (like false positives if the underlying interface changes). The key to resolving timing issues is to make your tests use deterministic conditions for their checks instead of assuming and praying that time will always cooperate with you.
Shared state between tests
Many automated tests require a starting point in the form of data, whether it’s from a database, a file, or from memory. That initial state will get modified as it goes through its execution and validations. However, if you’re not careful, the modified state in one test case can bleed through into another one, causing the second test to fail because its data was not what it expected. This tends to happen with longer integration and end-to-end tests, which touch large sections of the codebase and increase the possibility of the tester forgetting to clean up after it.
The reliable fix is to make isolation automatic rather than something a person has to remember. The textbook approach is to set up the data each test needs before it runs and tear it down to clean up after. However, the teardown step depends on the tester and is easy to skip, so lean on mechanisms that reset the state for you. Wrapping each test in a database transaction and rolling it back at the end is a good default. It's fast, and the cleanup happens whether or not you remembered to write it. Maintaining proper “data hygiene” for your automated tests will save you countless headaches as the test suite grows.
Test ordering dependencies
The ideal way to write and execute automated tests is to allow them to run independently from one another. That means any test case can be run in isolation or in a different order and always produce the same result. If a test only passes when run in a specific order, it’ll be prone to flakiness. Like the issues mentioned above with shared state, this problem also surfaces mostly on end-to-end tests, where it can get tricky to set up and tear down data so it slips by unnoticed. Shared state is often the cause, with order dependence being the symptom.
Dealing with flakiness stemming from the order in which the tests run starts from the moment the test is created. Developers and testers need to make sure each test passes on its own, without leaning on others. Although running each test in isolation helps, a passing test on its own doesn't prove it's fully free of flakiness. An ordering dependency can hide until the whole suite runs together. A good way to smoke this out is to have your tests always run in random order. Most test frameworks provide a flag to randomize order, plus a seed number you can record from each run. The recorded seed lets you replay the exact failing sequence so you can reproduce and address the issue.
Tests using dates and time
Tests that rely on managing dates and times are very tricky to get right. We need to consider time zones when users are located across the globe. For instance, I can write a test to check that a date was set for today, but when someone is in a wildly different time zone, it can fail by showing the previous day instead. Date math is also complicated, such as using relative dates like “1 month ago,” since every month has a different number of days.
One way to make date/time tests more stable is to adjust the underlying code to permit passing in a reference time as a parameter so relative calculations like “1 month ago” resolve from a fixed point instead of whatever the clock happens to say. If that’s not possible, a common way testers can deal with relative times is to use a library that helps mock or “freeze” time so it won’t matter where or when the tests run. In addition, it’s a good idea to audit the time zone settings to make sure they’re consistent in every environment. An easy way to maintain consistency is to store dates and times in UTC on the backend and convert them to the user’s timezone on the frontend.
Lack of CI resources
Some types of test flakiness rarely show up in a developer’s or tester’s system. Our PCs and laptops are usually equipped with multi-core CPUs, gigabytes of RAM, and super-fast disk drives that help stabilize test runs. However, the entry-level tiers of continuous integration environments aren’t as generous with the resources they provide. Most systems running CI jobs are significantly underpowered compared to the average developer PC, particularly with disk I/O. With fewer resources at hand, it’s common to have heavier test scenarios flake during a CI run.
Organizations typically deal with the issue by paying for better CI runners. Most CI services let you bump up the power of the runners, and it’s the easiest way to resolve performance-related test flakiness, but it can come at a hefty cost. If pricing gets out of hand, an alternative is to use self-hosted runners, which can significantly cut down on expenses. I’ve used low-powered mini PCs that cost less than $200 to self-host CI jobs, and it’s surprising how well they work. These assume the test is genuinely resource-starved, though. If it's flaky because of a timing or state bug, a faster runner just hides the failure instead of removing it, so make sure you're paying to fix the right problem.
Flakiness Can Be Reduced but Never Fully Eliminated
No matter how careful you are in writing your tests or setting up your test environment, it’s inevitable that some flakiness will still creep in, even more so now that AI tools churn out tests in bulk without the care around timing, state, and cleanup that the sections above are about. The reasons covered in this article for why flakiness happens and how to manage it are just a few that you’ll likely run into. However, following the advice written here will get your tests more stable and keep them that way over time, even if other factors cause the flakiness.
The tips above will help you cut down on flakiness, but know that some will always slip through. Even the most careful developer or tester isn't immune to having some of their automated tests go flaky. That's where the danger I mentioned earlier shows up. Like the villagers who stopped believing the boy who cried wolf, you'll get used to waving off failures even if they’re real.
TestNod will close that gap for you. Your CI service will tell you when a test fails, but it won’t keep track of past failures to tell you that the current failure is due to flakiness. TestNod tracks which test cases didn't pass and when, so the moment a test breaks, you know if you're looking at a flake or a genuine regression instead of guessing or spending time and money on a re-run to find out. When you’re able to catch legit failing tests before they ship, your team can trust your test suite more and ship faster.
TestNod is currently in private beta, but I’m looking for a few organizations who want better visibility into their automated test suites and help them stamp out flakiness to make their development workflow run as smoothly as possible. Visit https://testnod.com/ and join the waitlist today.