TDD in Practice, Not in Theory

Strict TDD is a religion. Tests-as-you-go is engineering. The distinction matters more than the dogma.

There's a particular kind of engineer who argues about test-driven development at conferences but doesn't write tests at home. This is the worst mix of dogma and practice — they defend the purity of red-green-refactor while shipping untested code. Somehow, this is how TDD is often discussed in our industry.

Let me give a more honest accounting.

TDD in the strict Kent Beck sense — as in his book Test-Driven Development: By Example (2002) — write the failing test first, write the minimum code to pass, refactor, repeat, never skip the order — is a real practice with real merits. It forces you to think about the interface before implementation, produces high test coverage, gives constant feedback, and tends to produce smaller, decomposable code. These merits are real and desirable.

The downsides are real too. Strict TDD is slow during exploration phases when you don't know what you're building. Tests written before design stabilization often test the wrong things, get discarded, and you've paid for them twice. Writing failing tests first works better in domains with well-understood inputs and outputs (parsers, business logic, math-heavy code) and worse in emergent design domains (UI work, exploratory data pipelines, integration with external systems).

Almost nobody who claims to do TDD actually does strict TDD. Good engineers write tests close to writing code, not necessarily before, and prioritize tests where being wrong is costly. This "tests-as-you-go" or "test-first-ish" captures most of strict TDD's value without the dogma.

Here's how I think about test discipline:

What to test heavily. Anything where being wrong is expensive. Business logic that calculates money. State machines that must be correct. Data transformations with uncontrolled inputs. Edge cases in date/time handling, character encoding, currency conversion. Boundary conditions around limits (pagination, rate limits, retries). Code behind an API contract observable to customers. Code handling authorization. Cryptographic code (which you shouldn't write, but if you do, test it).

For all of these, write the tests. Before, after, during — order matters less than existence. Aim for branch coverage, not line coverage; interesting bugs hide in untaken branches.

What to test lightly. Integration glue code. UI rendering (a snapshot test or two suffices). Code where tests mostly re-assert obvious behavior. Code to be discarded in the next release. Configuration and wiring code where value lies in production deploy success, not unit tests.

What to barely test at all. Throwaway prototypes. Spikes. Code you'll delete next week. The Bubble version of your MVP. Anything where being wrong means "we throw it away and learn." Test fewer of these than your conscience suggests; you're trading test-writing time against learning velocity, and at the prototype stage, learning is the asset.

Integration tests are underrated. A high proportion of production bugs occur at seams — between services, modules, the app and the database. Unit tests don't catch these. Teams investing in integration tests (real database, real network calls to test doubles, real end-to-end through the system) have dramatically fewer production incidents than those who don't, regardless of unit test coverage. Integration tests cost more per test but offer higher value per test, favoring investment.

End-to-end tests are usually overrated. Selenium suites, Playwright suites that exercise the whole UI, big tests that take 20 minutes to run — these feel comprehensive but are slow, flaky, expensive to maintain, and produce ambiguous failures. ("The login button is broken" — is it the JavaScript, auth service, database, CSS?) Have a few true E2E tests for critical paths (sign-up, checkout, main user flow). Don't aim for full E2E coverage. The math doesn't work.

Property-based testing is the secret weapon nobody uses. Tools like Hypothesis (Python), QuickCheck (Haskell, ported to many languages), or fast-check (JavaScript) generate test inputs across a range of property-defined possibilities and find counter-examples. Hillel Wayne's writing on property-based testing as the next step after TDD is the best operator-grade introduction. For algorithmic code, this catches bugs that example-based tests will never catch. The investment to learn it is one afternoon. The payoff is enormous on the right code. Mostly unused in our industry. Try it.

Test brittleness is a real cost. If tests break every time you refactor, they're holding your code back, not protecting it. Tests should assert on behavior, not implementation. A test that mocks the database, cache, logger, and queue tests whether the code calls the mocks as expected. That's not a test of behavior. It's a test of one possible implementation. Refactor the implementation, and you must rewrite the test. After a few cycles, the team stops refactoring because it breaks tests, and the codebase ossifies. This is one of the most insidious failure modes in a tested codebase.

The fix is to assert on observable behavior — the output, state, external side effects. Avoid mocking inside your own bounded context; mock only at the boundary (third-party API, queue, email service). Use real instances of your own classes wherever possible.

Speed matters more than people admit. A test suite that takes 20 minutes to run is a test suite people skip. Engineers will run "just the relevant tests" before pushing. The CI will run the full suite, but a 20-minute CI loop means PR reviews happen in batches and engineering velocity craters. Aim for a unit suite under 30 seconds, an integration suite under three minutes, an E2E suite under ten. If you're over these numbers, fixing test suite speed is one of the highest-leverage things an engineering team can do.

Flaky tests are bugs. Every time a test fails intermittently and someone re-runs CI to make it pass, the team trains itself to ignore test failures. Eventually, a real test failure gets re-run and ignored too. Flaky tests should be treated like production incidents — investigate, root-cause, fix. If you can't fix it, delete the test. A deleted test is better than a flaky test, because a deleted test forces you to acknowledge what you don't have, while a flaky test gives false confidence.

Coverage numbers are vanity. A team with 90% line coverage and bad tests has worse quality than a team with 40% coverage and good tests. Don't set coverage targets. Set "we test the things that matter, and we know which things matter" as the bar. This is harder to measure and far more useful.

AI-generated tests are now the default, and that's the new failure mode. Claude Code, Cursor, and Codex will generate tests alongside the code they write. The output is usually fine for easy cases and dangerously thin on hard ones. Worse, the model often writes tests that match its own implementation rather than tests that verify the intended behavior, meaning tests pass even when the code is wrong, because both are wrong in the same direction. The discipline: read the tests as critically as you read the code, and add the edge cases the model didn't think of. The cost of writing tests has fallen; the cost of knowing what to test has not.

Finally: the cultural part. Engineers who care about testing tend to ship more reliable software, and those who don't ship more bugs. This is true on average. But it is not true that the most rigorous testers are the best engineers. The best engineers I've worked with have nuanced views about what to test, when to test, and what testing costs. The worst-tested codebases come from engineers who don't write tests because they "know their code works" (fragile arrogance), or those who test everything with full mock pyramids, resulting in change-resistant codebases (fragile rigor). The right disposition is in the middle: tests are a tool, the tool has costs, use the tool when the cost is worth it.

Write tests. Write fewer than dogma demands, more than your instincts suggest, and target them where being wrong is expensive. That's TDD in practice. The theory is fine. The theory will not ship the product.