Reducing Defect Cost

The conventional wisdom is that defects cost much more, some say exponentially more, the later in the development lifecycle they are found. This is a obvious generalization rather than a universal fact. There are many many variables such as the type of defect, the type of application, the development techniques being used and organizational structure. Still, it appears many people just accept the generalization without investigating whether it’s generally valid or not.

I’ve had discussions with developers who believe all defects can be avoided if only we work at it hard enough. This view combined with the exponential cost belief leads them to conclude that we should avoid defects at all costs. Although I believe that most teams can significantly improve the quality of the software by using development techniques that catch defects earlier, I also believe it’s wise to consider the cost of avoiding some defects. It’s possible that an obsessive focus on zero defects, assuming it’s actually a feasible goal, can actually reduce the team’s ability to deliver business value in some cases. The following are some tradeoffs to consider.

Likelihood of occurrence (can be qualitative)
Difficulty of testing for the defect (effort cost)
Difficulty of fixing the defect, if the defect occurs
Business impact, if the defect occurs

For example, a defect that we think is likely to occur, easy to test, difficult to fix and will have a significant impact on business would be one we’d want to invest significant effort in avoiding. A defect that isn’t likely, is very difficult to test, easy to fix if it happens, and will have little or no business impact is one that we may want to take our chances on without tests. This doesn’t eliminate the possibility of doing some testing. I’m referring specially to automated regression tests.

Most defects characteristics lie somewhere between these two extremes and can be detected in a cost-effective manner using test-first or test-driven development techniques. However, other types of defects are very difficult to find before a system is deployed. Often these defects are related to interactions with external system outside our control or are difficult to predict. A defensive strategy may be the best option for these defects. A team can use agile deployment techniques or modular architectures to reduce risk if a defect is detected in a deployed system. Examples of agile deployment techniques include network launching or updating of client applications or the ability to patch running servers.

The server software I helped develop at a previous employer had some of these capabilities. The server design allowed it to be patched without any downtime experienced by the client software. The same mechanisms also supported failover and load balancing within clusters of server processes. We felt it was important to detect defects as early as possible and had thousands of tests to support that goal. However, when an occasional defect leaked through to production, we could often diagnose and patch it very quickly without system downtime.

The defects found in production did cost more than if they had been caught earlier. However, it was far less than an exponential or order of magnitude increase in cost that some teams experience. We could also use the same mechanisms to quickly add or modify features based on customer feedback (specifically, numerous end users vs. a proxy customer) after deployment.

Defects in the system I’m describing were relatively rare because of the XP-style development techniques used to develop it. We had ~6000 unit tests and ~600 system-level functional and acceptance tests. The system was also manually tested in a replica of the production environment when complex new functionality was added. At times, actual end users volunteered to test features in this preproduction staging area. When there was a server-side defect, the following was the typical scenario.

Diagnose the problem. This typically required an hour or less because the existing tests ruled out many possible causes of the problem.
Write a failing test. This usually required 15-30 minutes although it might take only a few minutes depending on the problem.
Implement the fix and verify all tests pass. Usually this required 10-15 minutes. Most of the problems were relatively minor and the test suite required less than 5 minutes to run.
Patch the server. The server could be patched while running. The patch might take about 5 minutes or less to apply but the users experienced no downtime.
Operation verifies the patch. This usually required about 5-10 minutes.
We merged the changes into the development branch of our SCM system. This usually took less than 10 minutes including running all tests again.

Summing the higher side of the typical time gives about 2 hours. Assuming a developer pair is working on most of the steps, the effort would be 4 person-hours or 0.5 person-days. Compare this to the hypothetical (but presumedly realistic) examples described in another article on defect cost [[1]]. The defect cost ranged from a high of 20 person-days per fix to a low of 5 person-days per fix. This is a huge difference relative to our team. I realize the techniques we used will not be applicable for all applications, but it’s worth considering given the potential cost reductions.

On the client side, we dynamically deployed the application over the network using Java Web Start. This was relatively agile although the users were required to stop and restart their client programs to receive patches or new features. This resulted in somewhat higher costs for postdeployment defects in the client. Unfortunately, the UI defects were also more difficult to catch early although many were caught during development using TDD techniques.

Modular design is another application development strategy that can reduce the cost of defects that slip through testing. Changes to defective code are less likely to ripple throughout the software, making the fix easier to implement and deploy.

Agile software development techniques claim to reduce the cost of change and I believe this is a valid claim. It’s not clear to me why this effect would only apply to new features and not to defect fixes. One possibility is organizational overhead associated with defect management. The varies widely from company to company. Reducing the layers of bureaucracy involved when fixing a reported defect will reduce both the financial costs and response-time when fixing the problem. A strategy here is to introduce more agility into the organization itself. If the organizational overhead cannot be reduced then it pushes the tradeoffs mentioned earlier towards using increasingly costly methods for finding the most difficult defects.

These strategies can reduce the cost of defects found after deployment. However, they should not be used as an excuse for not doing adequate testing before delivering software. Most defects can be avoided in a very cost-effective way using agile development practices. When defects are found in a deployed application, they should be evaluated to determine if it could have been easily detected earlier in the development lifecycle. An agile team will use this feedback and adapt its testing techniques accordingly.

References

[1] Rothman, Johanna. What Does It Cost You To Fix A Defect? And Why Should You Care?.

References

Leave a Comment Cancel reply