illustration of a tester checking a live code with a magnifying glass

Dilemmas Of Testing In Production (And A Way Forward)

The problem of how and when to test in production is a perennial conundrum that far predates the current dominance of cloud computing and widgetized, containerized software services. 

In the Bronze Age of software development (i.e. 2005), the issue was straightforward, but in a very frustrating way, because there were so few options around how to do it effectively, and many dangers that seemed to outweigh the usefulness of testing in production (hereafter, TIP for short).

The Catch-22

You desperately want to find bugs that only manifest in the “real world” of production. Yet provoking those bugs may destroy that very same production environment. TIP is a classic example of a Catch-22. 

The more effective it is at manifesting catastrophic bugs — which is what you want — the greater a danger it is to the business continuity of your company itself. If your business is transaction processing, for example, you can’t afford for your production environment (PE) to be down for even a minute.

The Dangers Of Testing In Production

Recall that the Chernobyl reactor meltdown was caused by a safety drill conducted on the live nuclear reactor. In its production environment. Kind of sobering, yes?

Because of the dangers posed to business continuity inherent in TIP, it historically could only be done for very short periods of time, and usually towards the end of the testing cycle as a final check. Which is a very bad idea, for two reasons.

Testing In Production Turns Up Systemic Issues

First, waiting until the final phase of testing to perform TIP kind of defeats its purpose. Because significant issues uncovered by TIP will, by definition, be systemic. How could they be otherwise, if everything else works fine in non-production environments?

And systemic software issues are far and away the most time consuming to diagnose, fix, and retest. And the likelihood of fixes for systemic problems working the first time around is very small, because of the irreducibly complex nature of systemic issues, and the forensic obscurity of their causes.

So saving this testing for the end is recklessly end-loading significant risk to your delivery schedule. This approach pretty much guarantees you will have to add at least a month or two (if not more) to address the issues found by TIP. Which is not a good look for a product delivery team, trust me.

Testing In Production Turns Up Load, Scalability, and Stability Issues

Second, and closely related to the first, is that the most common systemic faults exposed in TIP involve:

  1.  Load and scalability problems
  2.  System stability /MTBF failures

And these are, unfortunately, the greatest threats to the success of your product in the field. 

No customer is going to rip out your gorgeous, state of the art distributed cloud solution because Feature 9.19.D failed in production. Seriously. You can just patch that and move on. But if your product is bringing the customer’s own business processes to their knees because of performance and stability failures, the customer will not be so understanding and forgiving.

So purely from a business point of view — both the customer’s and yours — saving TIP until the tail end of the release cycle is a classic, and devastating, failure pattern.

The Case For Completing Testing In Production

Which means that, contrary to popular software dev folklore, TIP must take place as early as possible in the test cycle. And must be repeated throughout that test cycle. Otherwise you’re playing a game of chicken with your product release.

This just brings us back to our original Catch-22 however. Hence the frustration inherent in trying to come to grips with how to strategize the implementation of TIP. 

The more often you do it, the more diagnostic your testing will be, and defect detection closer to the point of the injection of the defect. Yet for this very reason you are increasing the likelihood that your testing may cause production to fail, or be severely impaired.

This problem is just the general version of the classic problem that has dogged performance and load testing for decades. Everyone saves it for the end, when it is the least effective and most costly to the project. 

The Feature Fallacy

Historically this has been due to what I call the feature fallacy. That is the notion that you can’t test for systemic issues, like performance, until the product is “feature complete.” So it has to wait until the end.

But this is just nonsense. The chance of a single feature crashing an entire system is minuscule. Load, performance, and stability failures are themselves caused by high-level systemic flaws in architecture and system design, particularly relating to memory management and resource usage. There is no rational reason to see “feature completeness” as a gating item for integrated TIP.

The Impact of Modern Distributed Architectures

This has seemingly become less of an issue as software itself has migrated from a feature-centric model to a service-centric model. But the basic problem remains. 

Instead of thinking you need all the features complete before you can TIP, the fallacy has simply migrated upwards, transforming into the fallacy of thinking all the services (however containerized) must be complete before TIP is feasible.

There is more apparent validity to that assumption in distributed, service-architected software systems. Because these services are by definition fundamental building blocks of the system as a whole. Something atomic, individual features never have been. Even so, this still leads to the end-loading of the risks associated with TIP, and all its associated dangers.

Containerized Service Software

Paradoxically, a distributed, service-based solution paradigm makes the problem even harder to solve. Not less.

The promise of containerized service software is that this architecture makes failures of particular services, or clusters of services, much easier to detect and isolate from the rest of the system. Thus making it far less likely that any local failure will bring the whole system down. 

It also means, in theory, that fixing and updating those self-contained service clusters will be faster and less risky. Spinning up new versions of specific services and their containers (and perhaps their container mapping model) is a far less fraught and complicated task than doing a complete rebuild and redeployment of your entire system. And that really is progress.

Here’s The Catch

There’s a fateful asterisk at the end of all that good news.

The containerization of services is not the same thing as isolating them from one another, as we all know. Each service must still be able to communicate and cooperate with other services or groups of services. The resulting level not only of interdependence, but of levels and layers of interdependence, is very, very high. 

So a containerized, service-based architecture has merely reincarnated the problem of component interdependency inherited from earlier models of software architecture. The components of service architectures are just more granular and for that reason — perhaps — more clearly delimited and more easily understood.

But any time you have high levels of interdependence and inheritance among software or service modules, you also have a high level of the risk of general system failures as service-level failures cascade through their dependency trees. Going viral, as it were. Since atomic services, unlike atomic features, are by definition system components.

The Atomization Of Software Services

This risk also exists at the meta-service level. It also resides at the level of the dependency maps that define which services, and families of services, depend on which others, and for which functions or processes. And whether these dependencies are upstream or downstream. Constrained or unconstrained, etc. 

In other words, the very atomization of software services creates the need for precise mappings of how they must relate to and cooperate with one another. Since it is the nature of this architecture, normally, that individual services don’t encapsulate the entirety of a process or system capability. Which actually further complicates the problem of TIP, instead of simplifying it.

The irreducible complexity and messiness of system testing, and therefore its inherent risks, remain, regardless of the architectural methods currently in use.

A Path Out Of The Dilemma

When and how to test in the PE are never going to be easy questions to answer. Both practically and conceptually. Let’s address each of those aspects separately.

The Practical Side

On the practical level the Catch-22 presented by TIP is partly an artifact of how stark the gulf almost always is between the structure and capacities of the test environment (TE) — especially in the case of test data and transaction traffic — and the realities of the PE. 

The latter is almost always many magnitudes more complex, and the actual data/transaction stream much larger and variable, across all parameters, than what is available in the test environment. This yawning chasm exists largely due to cost constraints. 

In my experience, most companies are loath to spend the money necessary to create a test environment that is even a ½ scale model of the PE. Which is very short sighted, because it makes the business a captive to the Catch-22 described at the beginning of this article.


One solution to this problem is simply not to spend more money on a more reliably diagnostic TE. Because even a greatly enhanced TE will not be entirely diagnostic against the PE. The capability gap is still wide, too starkly digital to escape our Catch-22.

It would be a more useful solution if, instead of having only one TE, however upgraded, companies would invest in a series of progressively more complex and diagnostic TE’s. 

In particular, invest in one TE that is optimized for system testing, and another that is optimized for specific feature or capability testing. Isolating those functionalities in different environments would make testing much more targeted, streamlined, and diagnostic across the board.

In other words, make the gap between the TE(s) and the PE a ladder, not a canyon. So that many of the problems that in the past you could only uncover in the PE could be detected earlier in a series of targeted TEs. Thus decreasing the time and risk of testing in production. And also allowing you to discover serious problems long before the final phase of the release cycle.

The Conceptual Side

A possible solution on the conceptual side of the problem is closely related to the practical steps described above.

In previous articles, I have focused on the prevalence of what I call “the empirical fallacy” in software testing. By this I mean you have to create or recreate, an issue in real-time, before your own eyeballs, with the full, actual product before you can detect it. Engineers subscribe to this fallacy as well. 

Many times in my career I’ve had the experience that when an engineer is confronted with an alarming systemic problem, they default to the posture of, “Well, I can’t see it happening in real time, right in front of me, so I don’t know how to diagnose or fix it.”

The irrationality and wastefulness of this approach, on the part of both QA and engineering, I hope is obvious. Because it assumes an almost childlike level of forensic imagination and intelligence in both functions. 

If we all are really the brilliant, out-of-the-box-thinking boffins we like to think we are, and constantly advertise ourselves to be so to everyone else, why do we need to “see it happen” before we can theorize about why it might be happening in the first place? 

It’s like a homicide detective shrugging and saying she can’t solve a murder because she wasn’t there when it happened.

Systemic Issues and Architectural Issues

Systemic issues are almost always architectural issues. It stands to reason, then, that they should be taken into account from the very start, and their development and testing prioritized. 

Such that, for example, load, performance, and stability testing could begin in a properly configured TE very early in the development cycle, long before the product was feature complete.

Doing this would require a major reorientation of how most development projects are organized and scheduled. This prioritization, for example, poses a major challenge to agile methodology. 

Because the development of systemic properties of distributed software cannot be divided into sprints, it cannot be fractionalized the way features are in agile development. It has to be designed and implemented holistically.

But there’s no reason these adjustments can’t be made, even if they take people out of their comfort zones for a while. The practical, and lasting, benefits of this reorientation will more than compensate for the changes.

Removing Reliance On The Empirical Fallacy

Taking this idea of removing reliance on the empirical fallacy further, it becomes apparent that there is no reason engineering and QA cannot collaborate on developing structured, targeted tests on the performance architecture of the software system, in lieu of trying to test everything at once in a brief, yet hazardous, production testing window. 

In short, based on engineering’s knowledge of how they actually architected the system, they should be able to hypothesize what its performance and stability weak points and bottlenecks might be, before formal testing begins, without having to first see them fail before their very eyes in production. Or worse, in the field.

QA could then take these scenarios and build a test apparatus designed to stress them. So that they fail — or fail to fail — in an environment and context that are already isolated and delimited, and therefore diagnostic for engineering.

These two approaches — a tiered set of progressively complex TE’s coupled with a predictive risk model based on prior knowledge of the product’s architecture — would greatly reduce both the risks currently inherent in production testing, while at the same time greatly enhancing its diagnostic value.

The decision to test in the PE should not be the equivalent of flipping a lightswitch on. Going in an instant from complete darkness to blinding light. And then only being able to leave the light on for thirty seconds. 

Testing in the live PE should rather be the culmination of a phased process of progressive approximation to the PE in well-designed TEs. A process that progressively reduces the risks and limitations of testing in production in the first place.

Final Thoughts

Of course you should be testing in the PE. But it should never be something you do only at the very end, having up to that point only tested in vastly inferior TE’s that bear little resemblance to the live, customer-facing environment where the product must succeed or fail, for all the world to see.

As with most other problems of QA theory and practice, their solution lies in processes that enable the progressive, rational reduction of risk over time. Not quantum leaps into the unknown in the final hours of the project. It’s a product release after all. Not a moon shot.

My thoughts here may not be exactly your own, and they don’t need to be. But if you focus on the logical principles that motivate them, you will have no trouble finding your own way to effectively, and painlessly, test in production.

As usual, all the best in your testing, and in your thinking about testing.

For more from experts in QA and software testing, subscribe to the QA Lead newsletter

Join forces with other innovators in the quality engineering world.
Get on the waitlist to be part of the community that’s forging the future of quality engineering and leadership in tech.