Skip to content

Instantly share code, notes, and snippets.

@lmmx
Created October 26, 2025 10:49
Show Gist options
  • Save lmmx/49603403cdfb9a4f97f7111fbcc26e12 to your computer and use it in GitHub Desktop.
Save lmmx/49603403cdfb9a4f97f7111fbcc26e12 to your computer and use it in GitHub Desktop.
Node outlines by Claude Sonnet 4.5 of Test Doubles, ch. 13 of "Software Engineering at Google" https://abseil.io/resources/swe-book/html/ch13.html

Test Doubles Overview

  • Stand-ins for real implementations in tests (like stunt doubles)
  • Three types: fakes, stubbing, interaction testing
  • Google learned the hard way: overusing mocking frameworks creates maintenance nightmares with few bug finds
  • Practices vary widely across Google teams

Making Code Testable

  • Testability requires upfront investment (harder to retrofit later)
  • Use dependency injection to create "seams" for test doubles
  • Mocking frameworks reduce boilerplate but come with major caveats

Prefer Real Implementations

  • First choice: use real implementations (same as production)
  • "Classical testing" vs "mockist testing" - Google found classical scales better
  • Real implementations give confidence; test doubles isolate but don't prove correctness
  • Use real implementations when: fast, deterministic, simple dependencies
  • Trade-offs to consider: execution time, determinism/flakiness, dependency construction complexity

Fakes: The Best Test Double

  • Lightweight implementation behaving like real thing (e.g., in-memory database)
  • Single fake can radically improve testing experience across organization
  • Must maintain fidelity to API contracts (same inputs → same outputs)
  • Fakes need their own tests (contract tests against real implementation)
  • Team owning real implementation should own the fake
  • If no fake exists: ask owners to create one, write your own wrapper, or use real implementation

Stubbing: Use Sparingly

  • Quick way to hardcode return values inline
  • Dangers: tests become unclear (extra code obscures intent), brittle (leaks implementation details), less effective (no fidelity guarantee, can't store state)
  • Appropriate use: when you need specific return value to reach certain state, and each stub directly relates to assertions
  • Still prefer fakes/real implementations even when stubbing seems appropriate

Interaction Testing: Avoid When Possible

  • Validates function calls without executing them
  • Problems: can't prove system works (only that functions were called), exposes implementation details ("change-detector tests")
  • Appropriate when: can't do state testing (no real implementation/fake), or call count/order matters (e.g., caching)
  • Best practices: only for state-changing functions (sendEmail, not getUser), avoid overspecification (use any() for irrelevant args)
  • Not a replacement for state testing - supplement with integration tests

Key Principles

  • Prefer real implementations > fakes > stubbing > interaction testing
  • Test behavior through state, not through validating internal calls
  • No exact answers - engineer judgment and trade-offs required
  • Eventually need larger-scope tests to exercise real dependencies
  • Test Doubles
    • Introduction

      • Unit tests critical but difficult for complex code
      • Example: testing function that hits external server + database
      • Test double definition: object/function standing in for real implementation (like stunt double)
      • Avoid term "mocking" (ambiguous)
      • Types: simpler implementations (in-memory DB), validation of system details, triggering rare errors
      • Enable small tests despite production code needing multiple processes/machines
      • Test doubles more lightweight than real implementations
      • Complications and trade-offs introduced
      • Google's experience: benefits when used properly, negative impact when misused
      • Historical lesson: danger of overusing mocking frameworks
        • Initially seemed perfect for every case
        • Easy to write focused, isolated tests
        • Years later: high maintenance cost, rarely found bugs
        • Pendulum swinging back toward realistic tests
      • Practice varies widely across teams at Google
        • Inconsistent knowledge
        • Inertia in existing codebases
        • Short-term ease vs long-term consequences
    • The Impact of Test Doubles on Software Development

      • Basic concepts foundation for best practices

      • Testability

        • Code is testable if written to allow unit tests

        • Seam: makes code testable by allowing test doubles

        • Enables using different dependencies in tests vs production

        • Dependency injection

          • Common technique for introducing seams
          • Classes receive dependencies as parameters instead of instantiating directly
          • Enables substitution in tests
          • Example: PaymentProcessor constructor accepts CreditCardService
          • Production passes real implementation, tests pass test double
          • Automated DI frameworks reduce boilerplate (Guice, Dagger at Google)
          • Dynamic languages (Python, JavaScript) can replace functions/methods
          • DI less important in dynamic languages
        • Testability requires upfront investment

          • Critical early in codebase lifetime
          • Later = more difficult to apply
          • Code without testing in mind needs refactoring/rewriting before adding tests
      • Applicability

        • Mocking frameworks
          • Software library for creating test doubles within tests
          • Creates "mock" with inline-specified behavior
          • Reduces boilerplate vs defining new classes
          • Example: Mockito for Java
          • Available for most major languages
          • Google uses: Mockito (Java), googlemock (C++), unittest.mock (Python)
          • Significant caveats: overuse makes codebase harder to maintain
          • Problems covered later in chapter
      • Techniques for Using Test Doubles

        • Three primary techniques

        • Brief intro for quick overview

        • Detailed discussion later

        • Engineer awareness of distinctions helps choose appropriate technique

        • Faking

          • Lightweight API implementation
          • Behaves like real implementation
          • Not suitable for production
          • Example: in-memory database
          • Often ideal technique when test double needed
          • May not exist for needed object
          • Writing one challenging: must ensure similar behavior now and future
        • Stubbing

          • Giving behavior to function with no behavior on its own
          • Specify exact return values
          • Example: when(...).thenReturn(...) in Mockito
          • Typically done through mocking frameworks
          • Reduces boilerplate
          • Quick and simple but has limitations (discussed later)
        • Interaction testing

          • Validate how function is called without calling implementation
          • Test fails if function not called correctly (not at all, too many times, wrong args)
          • Example: verify(...) in Mockito
          • Sometimes called "mocking" (avoid this term - confusing)
          • Typically done through mocking frameworks
          • Reduces boilerplate for tracking calls and arguments
          • Useful in certain situations but avoid when possible
          • Overuse causes brittle tests
    • Real Implementations

      • Prefer Realism Over Isolation

        • First choice: use real implementations (same as production)
        • Higher fidelity when executing code as in production
        • Preference developed over time at Google
        • Saw overuse of mocking frameworks pollute tests
          • Repetitive code
          • Out of sync with real implementation
          • Made refactoring difficult
        • Known as "classical testing"
        • Contrast: "mockist testing" prefers mocking frameworks
        • Google found mockist testing difficult to scale
        • Requires strict design guidelines
        • Most Google engineers write code suitable for classical testing
        • Real implementations make system under test more realistic
        • All code in real implementations executed in test
        • Test doubles isolate system under test from dependencies
        • Prefer realistic tests for confidence
        • If unit tests rely too much on test doubles: need integration tests or manual verification
        • Extra tasks slow development, allow bugs to slip through
        • Replacing all dependencies arbitrarily isolates implementation
        • Good test should be independent of implementation
        • Should test API, not implementation structure
        • Test failing from bug in real implementation is good
        • Indicates code won't work in production
        • Bug can cause cascade of test failures
        • Good developer tools (CI) make tracking failures easy
      • When Should You Use a Real Implementation?

        • Preferred if fast, deterministic, simple dependencies

        • Use for value objects (money, date, address, collections)

        • For complex code: often not feasible

        • No exact answer - trade-offs to consider

        • Execution time

          • Unit tests should be fast
          • Want quick feedback during development
          • Want quick finish in CI
          • Test double useful when real implementation slow
          • No exact threshold for "too slow"
            • 1ms added per test: not slow
            • 10ms, 100ms, 1s, etc: depends on context
          • Depends on productivity loss, number of tests using implementation
          • 1s extra reasonable for 5 tests, not for 500
          • Borderline: simpler to use real implementation until too slow
          • Then update to test doubles
          • Parallelization helps reduce execution time
          • Google infrastructure: trivial to split tests across servers
          • Increases CPU cost, large developer time savings
          • Trade-off: real implementation increases build times
          • Must build real implementation + all dependencies
          • Scalable build systems (Bazel) help with caching
        • Determinism

          • Deterministic: for given version, test always same outcome (always pass or always fail)
          • Nondeterministic: outcome can change even if system under test unchanged
          • Nondeterminism leads to flakiness
          • Occasional failures even with no changes
          • Flakiness harms test suite health
          • Developers distrust results, ignore failures
          • If rare flakiness: might not warrant response
          • If frequent: replace real implementation with test double
          • Real implementation more complex than test double
          • Increases nondeterminism likelihood
          • Example: multithreading can cause occasional failures
          • Output differs based on thread execution order
          • Common cause: code not hermetic
          • Dependencies on external services outside test control
          • Example: reading web page can fail (server overloaded, content changes)
          • Use test double instead
          • If not feasible: hermetic server instance (life cycle controlled by test)
          • Hermetic instances discussed in next chapter
          • Another example: code relying on system clock
          • Output differs based on current time
          • Test double can hardcode specific time
        • Dependency construction

          • Real implementation: must construct all dependencies
          • Entire dependency tree: object + its dependencies + their dependencies, etc.
          • Test double often has no dependencies
          • Much simpler to construct
          • Extreme example: new Foo(new A(new B(new C()), new D()), new E(), ..., new Z())
          • Time-consuming to determine construction
          • Tests need constant maintenance when constructors change
          • Tempting to use test double (trivial construction)
          • Example: @Mock Foo mockFoo;
          • Creating test double simpler but significant benefits to real implementation
          • Significant downsides to overusing test doubles
          • Trade-off needed
          • Ideal solution: use same object construction as production
          • Factory method or automated dependency injection
          • Object construction needs flexibility for test doubles
          • Can't hardcode production implementations
    • Faking

      • If real implementation not feasible: fake often best option

      • Fake preferred over other techniques

      • Behaves similarly to real implementation

      • System under test can't tell difference

      • Example: fake file system with in-memory storage

      • Why Are Fakes Important?

        • Powerful testing tool
        • Execute quickly
        • Effectively test code without real implementation drawbacks
        • Single fake can radically improve testing experience
        • Many fakes = enormous boost to engineering velocity
        • Where fakes are rare: slower velocity
        • Engineers struggle with real implementations (slow, flaky tests)
        • Or resort to stubbing/interaction testing (unclear, brittle, less effective)
      • When Should Fakes Be Written?

        • Requires more effort and domain experience
        • Must behave similarly to real implementation
        • Requires maintenance when real implementation changes
        • Team owning real implementation should write and maintain fake
        • Trade-off: productivity improvements vs costs of writing/maintaining
        • Few users: might not be worth it
        • Hundreds of users: obvious productivity improvement
        • Create fake only at root of code not feasible for tests
        • Example: if database can't be used, fake the database API itself
        • Not each class calling database API
        • Maintaining fake burdensome if duplicated across languages
        • Solution: single fake service implementation
        • Client libraries send requests to fake service
        • More heavyweight (cross-process communication)
        • Reasonable trade-off if tests still execute quickly
      • The Fidelity of Fakes

        • Most important concept: fidelity
        • How closely fake behavior matches real implementation
        • If behavior doesn't match: test not useful
        • Test might pass but code path might not work in real implementation
        • Perfect fidelity not always feasible
        • Fake necessary because real implementation unsuitable
        • Example: fake database doesn't store on hard drive (uses memory)
        • Primarily: maintain fidelity to API contracts
        • For any input: same output and state changes as real implementation
        • Example: database.save(itemId) saves when ID doesn't exist, errors when exists
        • Fake must conform to same behavior
        • Think of perfect fidelity from test's perspective
        • Example: hashing API fake doesn't need exact same hash values
        • Tests care about unique hash for given input, not specific value
        • If API contract doesn't guarantee specific values: fake still conforming
        • Other examples where perfect fidelity not useful: latency, resource consumption
        • Can't use fake if explicitly testing these constraints (performance tests)
        • Resort to other mechanisms (real implementation)
        • Fake might not need 100% functionality
        • Especially behavior not needed by most tests (rare error handling)
        • Best to fail fast: raise error if unsupported code path executed
        • Communicates fake not appropriate in this situation
      • Fakes Should Be Tested

        • Fake must have own tests
        • Ensures conformance to API of real implementation
        • Without tests: behavior can diverge as real implementation evolves
        • One approach: contract tests
        • Write tests against API's public interface
        • Run tests against both real implementation and fake
        • Tests against real implementation slower
        • Downside minimized: only run by fake owners
      • What to Do If a Fake Is Not Available

        • First: ask API owners to create one
        • Might not be familiar with fakes concept
        • Might not realize benefits
        • If owners unwilling/unable: write your own
        • One way: wrap all API calls in single class
        • Create fake version not talking to API
        • Simpler than faking entire API
        • Often need only subset of API behavior
        • At Google: some teams contributed fake to API owners
        • Allowed other teams to benefit
        • Finally: settle on real implementation (deal with trade-offs)
        • Or resort to other test double techniques (deal with their trade-offs)
        • Think of fake as optimization
        • If tests too slow with real implementation: create fake for speed
        • If speedup doesn't outweigh creation/maintenance work: stick with real implementation
    • Stubbing

      • Way for test to hardcode behavior for function with no behavior

      • Often quick and easy to replace real implementation

      • Example: simulating credit card server response

      • Easy to apply: tempting to use when real implementation not trivial

      • Overuse causes major productivity losses for maintenance

      • The Dangers of Overusing Stubbing

        • Tests become unclear

          • Stubbing involves writing extra code to define behavior
          • Extra code detracts from test intent
          • Difficult to understand if unfamiliar with implementation
          • Key sign stubbing inappropriate: mentally stepping through system under test
          • To understand why functions are stubbed
        • Tests become brittle

          • Stubbing leaks implementation details into test
          • When implementation changes: update tests
          • Ideally: test changes only if user-facing behavior changes
          • Should be unaffected by implementation changes
        • Tests become less effective

          • No way to ensure stubbed function behaves like real implementation
          • Example: when(stubCalculator.add(1, 2)).thenReturn(3)
          • Hardcodes part of contract
          • Poor choice if system under test depends on real contract
          • Forced to duplicate contract details
          • No guarantee contract is correct (no fidelity guarantee)
          • No way to store state with stubbing
          • Difficult to test certain aspects
          • Example: database.save(item) then database.get(item.id())
          • Real implementation/fake: both access internal state
          • Stubbing: no way to do this
          • Example of overuse: test with many when() statements
          • Example of refactored test: shorter, no implementation details exposed
          • No special setup needed: credit card server knows how to behave
          • Don't want test talking to external server
          • Fake credit card server more suitable
          • If fake unavailable: real implementation with hermetic server
          • Increases execution time
      • When Is Stubbing Appropriate?

        • Not catch-all replacement for real implementation
        • Appropriate when needing function to return specific value
        • Gets system under test into certain state
        • Example: requiring non-empty list of transactions
        • Function behavior defined inline
        • Can simulate wide variety of return values or errors
        • Might not be possible to trigger from real implementation/fake
        • Each stubbed function should have direct relationship with test assertions
        • Purpose should be clear
        • Test typically should stub small number of functions
        • Many stubbed functions: less clear tests
        • Can be sign of stubbing overuse
        • Or system under test too complex (should refactor)
        • Even when appropriate: real implementations/fakes still preferred
        • Don't expose implementation details
        • Give more correctness guarantees
        • Stubbing reasonable as long as usage constrained
        • Tests shouldn't become overly complex
    • Interaction Testing

      • Validate how function is called without calling implementation

      • Mocking frameworks make interaction testing easy

      • Important to perform only when necessary

      • Keeps tests useful, readable, resilient to change

      • Prefer State Testing Over Interaction Testing

        • State testing preferred over interaction testing
        • State testing: call system under test, validate correct return value or state change
        • Example: sorting numbers, validating sorted result
        • Doesn't matter which algorithm used
        • Interaction testing example: can't determine numbers actually sorted
        • Test doubles don't know how to sort
        • Only tells you system under test tried to sort
        • At Google: emphasizing state testing more scalable
        • Reduces test brittleness
        • Easier to change and maintain code over time
        • Primary issue: can't tell system under test working properly
        • Only validates certain functions called as expected
        • Requires assumption about code behavior
        • Example: "If database.save(item) called, assume item saved"
        • State testing validates this assumption
        • Actually saves and queries to validate existence
        • Another downside: utilizes implementation details
        • To validate function called: expose that system under test calls function
        • Similar to stubbing: extra code makes tests brittle
        • Leaks implementation details into tests
        • Some Google engineers call these "change-detector tests"
        • Fail in response to any production code change
        • Even if behavior unchanged
      • When Is Interaction Testing Appropriate?

        • Some cases warrant interaction testing:
        • Cannot perform state testing: unable to use real implementation or fake
        • Real implementation too slow, no fake exists
        • Fallback: interaction testing to validate certain functions called
        • Not ideal but provides basic confidence
        • Differences in number/order of calls would cause undesired behavior
        • Interaction testing useful: difficult to validate with state testing
        • Example: caching feature should reduce database calls
        • Verify database not accessed more than expected
        • Mockito example: verify(databaseReader, atMostOnce()).selectRecords()
        • Interaction testing not complete replacement for state testing
        • If can't perform state testing in unit test: supplement with larger-scoped tests
        • Larger-scope tests perform state testing
        • Example: unit test validates database usage via interaction testing
        • Add integration test performing state testing against real database
        • Larger-scope testing important for risk mitigation
        • Discussed in next chapter
      • Best Practices for Interaction Testing

        • Following practices reduce impact of downsides

        • Prefer interaction testing only for state-changing functions

          • System under test calls dependency function: falls into two categories
          • State-changing: observable side effects (sendEmail, saveRecord, logAccess)
          • Non-state-changing: returns value, no side effects (getUser, findResults, readFile)
          • In general: perform interaction testing only for state-changing functions
          • Non-state-changing interaction testing usually redundant
          • System under test uses return value for other work you can assert
          • Interaction itself not important for correctness (no side effects)
          • Makes test brittle: update test when interaction pattern changes
          • Less readable: additional assertions obscure important assertions
          • State-changing interactions represent useful work changing state
          • Example: testing both types
          • addPermission() state-changing: reasonable to test interaction
          • getPermission() non-state-changing: not needed
          • Clue: getPermission() already stubbed earlier
        • Avoid overspecification in interaction tests

          • Test behaviors rather than methods (from Unit Testing chapter)
          • Test method should verify one behavior
          • Not multiple behaviors in single test
          • Apply same principle to interaction testing
          • Avoid overspecifying which functions and arguments validated
          • Leads to clear, concise tests
          • Tests resilient to changes outside test scope
          • Fewer tests fail if function call changed
          • Example of overspecification: test validates user name in greeting
          • Test fails if unrelated behavior changed
          • Validates all setText() arguments
          • Fails if setIcon() not called (incidental behavior)
          • Example of well-specified tests: behaviors split into separate tests
          • Each test validates minimum necessary for correctness
          • Uses eq() for relevant arguments, any() for others
    • Conclusion

      • Test doubles crucial to engineering velocity
      • Help comprehensively test code
      • Ensure tests run fast
      • Misuse: major drain on productivity
      • Can lead to unclear, brittle, less effective tests
      • Important for engineers to understand best practices
      • Often no exact answer: real implementation vs test double
      • Or which test double technique to use
      • Engineer might need trade-offs for their use case
      • Test doubles great for working around difficult dependencies
      • To maximize confidence: still want to exercise dependencies in tests
      • Next chapter: larger-scope testing
      • Uses dependencies regardless of suitability for unit tests
      • Even if slow or nondeterministic
    • TL;DRs

      • A real implementation should be preferred over a test double
      • A fake is often the ideal solution if a real implementation can't be used in a test
      • Overuse of stubbing leads to tests that are unclear and brittle
      • Interaction testing should be avoided when possible: it leads to tests that are brittle because it exposes implementation details of the system under test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment