Knowing how to troubleshoot tests is an essential software skill

Thomas Barrasso
Oct 7 · 13 min read
NASA First Launch, 1950 🚀

NASA’s earliest attempts at launching rockets were part of the Bumper program, comprised of unmanned rockets in the late ‘40s and early ‘50s. The next decade saw many rocket launches — some successful, while others burst into flames.

Then, in 1961, the Soviet Union successfully sent cosmonaut Yuri Gagari on the first manned rocket, making him the first human in history to journey to outer space.

Contrast that with Wan Hu, a legendary Chinese figure described as the “first astronaut.” For his ascent, he was seated in a chair attached to several dozen rockets. Most versions of this story are apocryphal, with Hu not surviving to make another attempt at spaceflight.

Wan Hu, Wikipedia

Wan Hu’s tale is a cautionary one: we should first test our ideas in a constrained environment before risking important resources (like our lives).

It is now common to see software in aviation, automotive, pharmacy, banking, and education. This makes software quality more important than ever, with software testing a vital part of improving quality.

Why Test Software

Unit tests are the front-line defense against bugs and errors. Without automated tests, production deployments are more like crossing your fingers and hoping for the best. Ultimately, the end-users become your test subjects.

Automated testing catches bugs early in the software development life cycle (SDLC), leading to measurable cost savings and overall improvements in software quality.

Yet, even when requirements are crystal clear and stakeholders are in alignment, engineers make mistakes translating their understanding of the requirements into executable code.

Bugs can take many forms, including errors in logic, arithmetic, and connectivity, as well as the result of obsolete dependencies, assumptions about data, poor documentation, or lack of proper version control. Errors are also commonly the result of unforeseen side effects.

Unit tests force us to think critically and verify that our code exhibits certain behaviors. In this way, tests serve the same purpose as an editor and can guard against subtle typos that might result in syntax or type errors.

For interpreted languages like Ruby, this is especially helpful because many of these exceptions are run-time errors rather than compile-time errors.

When unit tests are combined and run against real data between multiple, production-like systems, they become integration tests. This is where Behavior Driven Development (BDD) frameworks, like RSpec, can be both exceptionally powerful and frustrating.

Much like Rails, RSpec, in combination with libraries like factory_bot, DatabaseCleaner, Capybara, Timecop, and VCR, provides functionality that can easily seem like Ruby Magic™. But once you are familiar with these tools, it becomes possible to test each system in concert and in isolation.

After spending time writing (and rewriting) many RSpec tests, I want to point out the subtle and not-so-subtle issues I have writing and debugging tests.

How RSpec Tests Fail

Tests fail consistently or intermittently.

Reproducibility in tests is important because without it, developers question the value of the tests and eventually testing itself (think, Boy Who Cried Wolf Syndrome).

Consistently failing tests are easier to catch and commonly the result of mental fatigue.

Tests that fail intermittently are called flaky tests and are more difficult to analyze. They are usually caused by non-deterministic behavior, unavailable external services, non-transactional changes, assumptions about data, side-effects, and edge cases related to dates and time.

Here’s why RSpec tests fail and how to fix them.


Stubbing too much or unintentionally stubbing.

Stubs provide canned answers to calls made during the test, usually not responding at all to anything outside what’s programmed in for the test. — Martin Fowler, Mocks Aren’t Stubs

RSpec does not explicitly require the use of test spies to make message expectations. Instead, it offers the receive matcher, which can be used on any object to make expectations about branching behavior.

While this approach offers more concise and convenient syntax, it can also be easy to forget that the receive matcher stubs the target method by default.

Real estate 101 tells us that home prices only ever go up.

Now, we might want to test that assumption in RSpec:

When we run this test, to our surprise, it fails with the following error:

Failure/Error: expect(house.price).to be > 649000
expected: > 649000
got: 649000

As it turns out, the receive matcher stubs the target method unless we use and_call_original to un-stub. This no-oped appreciate_a_lot, and the result is that the house price remains unchanged.

One strategy is to separate branch testing from results tests. This keeps test cases concise and avoids issues with inadvertently stubbing methods.

A more general approach is to stub as little as necessary to avoid discrepancies between test and production code paths. Most importantly, unlike the example above, don’t stub the system under test.

TL/DR: Remember, stubbing is the default behavior of the receive matcher; use and_call_original to un-stub but retain message expectations.

In-Memory and In-Database Discrepancies

There are many ways to find and update active record models, but by default, active record will use an integer column named id as the table’s primary key.

Eventually, with Rails 4.2, came GlobalID, an app-wide uniform resource identifier (URI). As a result, Rails features include active job support, passing models directly.

This made it easier to pass models as arguments to jobs. However, it also masks what is happening behind the scenes which can cause confusion during testing.

Active job uses GlobalID::Locator.locate internally to deserialize the model GlobalID. It then performs a database query using the primary key.

Now, it is tempting to write a test like this one, as on the surface it looks like we are passing the user object directly.

The first test fails with an error like this one:

Failure/Error: expect(user).to receive(:perform_send_welcome_email!)

(#<User:#..>).perform_send_welcome_email!(*(any args))
expected: 1 time with any arguments
received: 0 times with any arguments

The second test also fails, but with a different error:

Failure/Error: expect(user.welcome_email_sent?).to eq(true)

expected: true
got: false

(compared using ==)

Both of these tests fail because the subject under test, a specific User, is the same in-database but not in-memory. Database changes are not automatically propagated to models in memory.

The first test could instead make use of a matcher like expect_any_instance_of(User) while the second test can be done easily with #reload, which re-syncs active record with the database.

TL/DR: Be cognizant of in-memory and in-database discrepancies; use #reload as necessary to re-sync a record with the database.


Sometimes it passes, sometimes it fails.

Random Number Generators


Can a computer generate a truly random number? Not really, at least not without special hardware. Instead, computers simulate randomness by drawing from a pool of entropy like cursor movements, keystrokes, hard-disk access, etc.

These “random” sources are then combined in a single location (like /dev/random) for applications that require pseudo-random behavior.

Regardless of whether the behavior is truly-random or pseudo-random, testing randomness can be quite difficult. Unsurprisingly, it is also an obvious source of non-deterministic behavior. Consider a Dice class:

Calling roll should return a random number between one and six, just like a real dice.

Next, we write tests to verify this behavior. Perhaps we also include a test that verifies that the dice is not somehow weighted (or maybe this code is for a gambling service and we want skewed results).

We run our tests locally, and they pass. They pass again in CI. A few days later, we notice an intermittent failure on the second test. The easiest solution would be to avoid randomness altogether. When that is not possible, we can instead provide a “seed” to the Random class.

Note: This is the same process we use when debugging randomly-ordered RSpec tests by manually setting the --seed option.

Pseudo-random number generators (PRNGs) are just functions that map one input to a seemingly-random, but deterministic sequence. Using a fixed seed provides us with a fixed sequence, and thus deterministic results.

Now, total will always yield 191 and the test will pass consistently! Although we have also introduced an inconsistency between test and production conditions. Fortunately, this scenario is not that common.

Unique constraints

Randomness will more likely show up as collisions between randomly generated data in a model factory with unique constraints.

The odds of collision with the ObjectId or similar UUIDs are exceptionally low, but the probability of duplication with the description are 1:1000. When randomly generated properties are necessary, the easiest solution is to pick from a larger pool of random.

Note: Faker is another great tool that can be used in test factories to create unique, structured data without the risk of collisions.

TL/DR: Avoid testing randomness. If you can’t, specify a seed and/or use more random random like SecureRandom.

State Preservation Across Tests

Arne Hartherz’s article summarizes this well: Using before(:all) in RSpec will cause you lots of trouble unless you know what you are doing.

before(:all) creates data outside of transaction blocks, persisting changes across tests.

Best practice (which Rspec supports using the --order rand option) is to run tests in random order to reveal implicit dependencies between tests. Data that persists between randomly ordered tests naturally results in inconsistencies.

Here is one example of how not to use before(:all):

If the tests are run in the order they are defined, the second test will fail:

Failure/Error: raise AlreadyVotedError

# ./voter.rb:26:in `vote'

This is why RSpec offers lazy-evaluated let, and eager-evaluated let! helper methods. When to use RSpec let offers more details, but in short:

Always prefer let to an instance variable.

TL/DR: Always use let, do not use before(:all).

Network Calls

Most applications rely on third-party services like Stripe or Square for payment processing, Splunk for logging, or New Relic for monitoring.

That means that applications generate lots of network requests. While we can trust that the engineers for these service providers have thoroughly tested their client libraries, it is important to test the use of these libraries within our applications as well.

Unfortunately, not all client libraries will support an environment for integration testing. Even if they did, networking issues could lead to transient failures in tests like this one. It is best not to test beyond the remote boundary when writing contract tests.

# --- Caused by: ---
# Net::HTTPServerException:
# 503 "Service Unavailable"

That is where tools like VCR are useful!

VCR lets developers “record and replay” HTTP interactions, including those generated by third-party tools. This makes tests fast, deterministic, accurate, and not reliant on networking availability or external services.

Tests are wrapped in a use_cassette block so that on their first execution, a YAML file gets created to save the response.

For more in-depth information, RubyGuides has a good tutorial on using and configuring VCR.

TL/DR: Use pre-recorded HTTP responses in tests to improve overall speed, reliability, and accuracy.


Not freezing time.

There are 28–31 days in a month, 365 days in a year, and 2,027 days in which Nixon was president. Actually, there are about 365.25 days in a Julian year (365.2422 to be more precise).

Alternatively, every fourth year can be called a leap year where we just tack on an extra day. February always gets that day, because it is the shortest month.

Some countries like the USA (and territories) span 11 timezones, while others like China observe just a single timezone (UTC+8). Then, for some crazy reason, twice a year a handful of countries increment or decrement their clocks in an attempt to coordinate working hours with sunlight.

The tests work reliably, between 01:00–23:00.

Needless to say, time is very complicated and testing time at or across boundaries is incredibly difficult. I have heard these kinds of tests referred to as Cinderella Tests because they turn into pumpkins at midnight.

To make these tests more reliable, it is best to become a Time Lord… or at least learn how to freeze and travel time in RSpec.

Here is an example with a Venue model that has many Events. If we want a list of upcoming concerts, we can compare start times to the current day.

Then, during testing, we want to make sure this scope returns concerts that happened today or will happen in the future.

This test looks fine. It even derives start times from a single source to guarantee consistent deltas. A problem arises if this test runs at a day boundary, for example, just before midnight (11:59:59pm).

If enough time elapses during concert creation, upcoming_concerts might get called the next day, causing the first expectation to fail.

The easiest way to resolve this with Timecop is to freeze time before :each and return after :each. This way, time elapses between but not during tests.

TL/DR: Use Timecop to freeze and travel time predictably in RSpec; test time boundaries and edge cases defensively.

Side Effects

Active record callbacks

Active record offers convenient life cycle callbacks before and after state alteration that are easy to “set and forget”. Thorough testing should include these callbacks, but sometimes it is necessary to sidestep them.

That is where skip_callback comes in. It can be used both in a spec factory as well as individual tests. Consider a User model:

While it is possible to stub send_welcome_email, in a real-world application this method might be buried deep in a stack trace and turn out to be only one of many callbacks.

Or, perhaps the callbacks were meant to run before a job asynchronously as above, but all jobs are being run inline for a specific test. Whatever the case, it is possible to instruct active record to skip callbacks for a specific model.

It may be that a User model is required for an Address to be considered residential.

However, creating and associating a User with the Address inside of a perform_enqueued_jobs block triggers an after-create callback, and ultimately the SendWelcomeEmail job.

That said, skipping callbacks should be avoided in favor of more explicit factories, recorded network calls, and environment-specific configurations.

For example, in the “test” environment, sending an email could be handled globally within ActionMailer. If you do choose to skip_callback, remember to call set_callback to restore callbacks for other tests.

TL/DR: skip_callback can avoid active record callbacks, but use cautiously.

Active Job QueueAdapters

A common use case for active record callbacks is to enqueue an active job. Under the hood, active job is configured to use a specific QueueAdapter. This adapter determines the queue order (like FIFO, LIFO, etc).

A common adapter for RSpec is the TestAdapter, which can be used to verifying that a specific job was enqueued successfully.

However, TestAdapter does not actually perform the job by default!

Depending on what you are testing, there are other adapters like the InlineAdapter that execute jobs immediately by treating perform_later calls like perform_now.

Alternatively, TestAdapter has a method, perform_enqueued_jobs, that, as its name suggests, actually performs the enqueued jobs synchronously.

As with callbacks, there is value to testing both with and without actually performing the job. RSpec provides helpful active job matchers like the have_enqueued_job matcher.

These helper methods allow for separation of concerns, making it possible to test the logic of a job in one spec, and the logic that triggers the job in another.

TL/DR: Use TestAdapter to track and perform enqueued active jobs.

Too Specific

Inconsistently ordered data.

Collections like Hashes and Arrays are used to store related data and are both enumerate in order of insertion.

During testing, this leads to comparisons that are implicitly order specific, often when creating and comparing collections of active record models. For instance, a Cat can have many toys.

A simple test is then written to confirm that a Cat can in fact have many toys.

factory_bot’s create_list helper creates three toys associated with our cat, Lovie.

The issue here is that while create_list will return three unique toys ordered consistently by creation, the association will return based on the scope ordering. If no order is provided, it defaults to order by ID.

For sequential IDs, this should not pose an issue because ordering by creation or sequence should be identical. However, the first four digits of an ObjectID represents seconds since the Unix epoch.

Instead of using eq to compare two arrays, we can use the redundantly-named match_array matcher, which is independent of order.

TL/DR: Use hash_including, include, and match_array when comparing collections independent of ordering.

Negative Test Expectations

Negative tests are a special case. Unlike positive tests, it is very easy to write overly-specific tests that ultimately test nothing. Consider a code path within a method that is not expected to raise a custom error.

A test for price_per_sq_ft might look something like this:

The first test expects an error to be raised, while the second test expects an error not to be raised. The issue with the second test is subtle, but thankfully is such a common occurrence that RSpec actually warns developers by default.


Using expect { }.not_to raise_error(SpecificErrorClass) risks false positives, as literally any other error would cause the expectation to pass, including those raised by Ruby (e.g. NoMethodError, NameError, and ArgumentError), meaning the code you are intending to test may not even get reached.

Instead, consider using expect { }.not_to raise_error` or `expect { }.to raise_error(DifferentSpecificErrorClass).

The RSpec warning clearly explains the issue with this test.

In the example above, the second test actually raises TypeError: nil can’t be coerced into Fixnum as nowhere have we defined @sq_ft! That is the problem with overly-specific negative tests, they can miss real problems like this one.

TL/DR: Favor positive tests over negative tests; write negative tests more broadly, especially when it comes to error handling.

Final Thoughts

The bitterness of poor quality remains long after the sweetness of meeting the schedule has been forgotten.

Test failures can be frustrating, especially when it is not clear why they are failing. However, try not to lose sight of the overall goal behind testing: to validate behavior and alert engineers of the unintended consequence behind potential changes.

As frustrating as failing tests can be, having no tests leads to unpredictable and unreliable deployments. As software systems grow in size and complexity, so too do the risks of going straight into production. The alternative is to push code and just cross your fingers.

Of course, if after reviewing these examples your tests are still failing, there is always the possibility that it is because there’s a bug!

After all, that is what tests are designed to do: identify bugs earlier in the SDLC. It is better (and cheaper) to catch bugs during testing than in production.

Happy bug squashing!

Better Programming

Advice for programmers.

Thanks to Zack Shapiro

Thomas Barrasso

Written by

Software Development Engineer–Science, Software & Sarcasm

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade