Sampling, applying to software forecasting

Published in

Forecasting using data

20 min readJul 7, 2017

Chapter 6

This chapter takes the statistical theory introduced in chapter 4 and 5, and demonstrates how to apply sampling to specific software or project forecasting problems. Obtaining a few reliable samples outperform gut instinct alone, and is much simpler than most people imagine.

Goals of this chapter –

How to use sampling to estimate story count for a larger feature or project
How to use sampling to estimate the un-discovered defects in a feature
How to use sampling to estimate the chance an issue exists after looking and not finding anything

Story count estimation — How big?

Unless you are the solo developer of a single person startup, or act like you work for one, there becomes a time when multiple people need to understand what is being built. An initial spark of an idea needs to be described for others to build. First at a high level for discussion and prioritization, eventually in enough detail to build what was intended.

In the Agile software development world, “user stories” are an important part of communicating a shared vision, taking an idea to features, and features to a product someone will buy. Humans are adapted to recollect and take meaning from stories, around a campfire or a conference table. Communicating work as a story builds a clearer picture about how a customer’s problem will be solved, making the resulting feature better match that need. Often bigger ideas are recorded as “epics” and only broken down into more manageable sized “stories” when they become viable. Stories and epics can be all sizes and shapes. Teams build their own rhythm of size they prefer, and patterns emerge in their historical epic and story data. We use the pattern of number of stories typical for an epic for forecasting likely story count of future features and products.

Why is having a story count important? Story count estimates help with forecasting size and eventually delivery time. We can divide the remaining story count with the current pace of delivery to estimate how (much) longer for this team with their story size pattern. Just like car travel, knowing the current speed of the car and how many miles or kilometers remain, we can estimate arrival time. In the software world we can use the pace stories are completed as an equivalent to speed, and the number of stories remaining as the distance still to travel. If it was this simple, this book would end about now. Story count is a necessary but not sufficient part of the forecasting equation.

The goal of any estimate, just like in the tank problem, is to be better than the method currently used. For story count estimates of a project, it can be tortuous to assemble a group of informed people to hash out the details. The alternative to this is someone just making up a number by gut instinct alone. It might surprise you, but that’s not a bad way. It’s just terrible if it’s your ONLY way. Think back to the Tank Problem estimates. Intelligence community estimates put the number at 1,000. The researchers put the number around 350, but they used multiple methods. They used the two samples from the transmissions, and they used the 96 dolly wheel samples. They used the Taxi Equation, and they used the double-the-average, and double-the-median methods. They had the data and it took seconds to do multiple techniques. Only when there was close agreement to the 350 number did it seem the more plausible answer. Creating more possibilities meant more confidence in the probabilities that were in agreement. We can’t know that the intelligence estimate isn’t using information we don’t have, so it should be considered just another process. Quiet people in the room might know something we don’t, so average all methods you deem reliable, even those you don’t believe or from people you don’t personally like or trust.

Case study — estimate the number of stories for replacing a major website

I often get called to participate in forecasting proposed major projects. Nothing I do only works on large projects, but the dollar values at risk mean its seen a good return on investment having an outsider double check. This proposed project was a top to bottom replacement of an existing website that integrated many sources of news data into a website. The website also endures rapid traffic fluctuations during local unfolding news events. Replacement projects are always difficult to size. Often what has been learnt from customer feedback over the five or ten years of development is not obvious or written down, causing these rewrite projects to incur a high degree of feature discovery risk not evident on the surface. As an aside, I suffer from what my wife calls “the eyebrows of judgement.” I don’t play a lot of poker. I have trouble biting my tongue, so it takes all my effort to not walk out on some projects within the first hour because they are glowing hot red with risk. Initially, this was one of those engagements.

The company had assembled a large consulting firm with their subject matter experts and development teams to break down the (known) existing website features into epics and stories. Alarming to me was the intention that this large group of people, fifteen plus detail oriented engineers were going to systematically going to analyze each proposed feature and perform story breakdown. How many features? 328. That’s right, there were 328 top level features (epics) to replace this website known so far. And the intention was to break every single one of them down into stories, then estimate those stories to deliver and estimate of time and resources needed.

I bill by the day, so this was good news for my retirement fund. Knowing how powerful statistical sampling is, my recommendation was that we could get to a reliable story count estimate by breaking down a subset of epics and extrapolating from those. How many? nine minimum, thirty over-achieving. Some epics had already been broken down into stories and this continued for a few more days after I arrived to achieve 45 epic story count samples. Table 4 shows those samples, each number is the count of stories for one epic.

Table 4 — Samples of story counts for randomly chosen features (epics) from a total pool of 328.

Having these samples, it was time to perform some analysis. First, I needed to confirm the samples were reliable. I first asked how the 45 epics were chosen and determined they all weren’t from the same part functional area of the website, or a super technical or simple area. The epics were from multiple parts of the website and no obvious pattern emerged. Happily, I moved on to confirming the counts had been performed with some diligence. I don’t take anyone’s word about randomness, so I used a simple technique to satisfy myself that the story counts per epic followed a simple heuristic called Benford’s Law. This will be fully explained later, but its essence is there is random numbers have a telltale pattern of how often their first or second digits occur. It’s not perfect, but it often highlights human invention versus random process. Humans aren’t very good at inventing random numbers it turns out. Our desire to make numbers look random cause patterns that can be identified by an overuse of some digits, and an under-use of others. This is one reason some people get their tax returns audited, tax officials use Benford’s Law to detect possible fraud in deductions and expense amounts. I didn’t see anything unexpected in the first and second digits for story counts, and this increased my confidence these samples were reliable.

Having 45 reliable samples, what could we do with them? If we make the assumption that the pattern of number of stories per epic of these 45 will be seen throughout the entire 328 epics, I could just sum the number of stories in the first 45 epics sampled and multiply that proportion of the total as shown in Equation 7.

Equation 7 — Extrapolating the sum of samples to a total amount

1713 is our first estimate. Is it better than gut instinct? Hard to tell, everyone’s gut instinct is different and subject to wishful thinking. Taking a page out of the book out of the researchers with the Tank Problem, how else can we come up with an estimate? Average multiplied by total epic count. Median multiplied by total epic count. Four groups of ten (first four columns in Table 4) extrapolated using the technique shown in Equation 7. Seven different estimates that we can average, one of them must be right!

45 estimates extrapolated for 328 epics = 235 x (328/45) = 1713

45 estimates averaged x 328 epics = 1640

45 estimates median x 328 epics = 984

first 10 estimates extrapolated for 328 epics = 1509

second 10 estimates extrapolated for 328 epics = 2263

third 10 estimates extrapolated for 328 epics = 1345

fourth 10 estimates extrapolated for 328 epics = 1312

Lowest = 984. Average = 1538. Median = 1509. 80th Percentile = 1698. Highest = 2263.

My final estimate was 1700. I took the 80th percentile value and rounded it up so it wasn’t read as more precise than it was, after all 1698 seems like it was computed more than 1700. The client knowing its current teams deliver approximately 10 stories a week immediately saw that this was a HUGE project (170 weeks) and wisely reconsidered alternatives, deciding to adopt a progressive replacement of key parts of the website. Sampling built a reliable enough story that it had more credibility than me just saying “this is huge, please don’t do it all at once.” It also makes a great story for teaching how sampling cost me a lot of consulting time I need to reclaim by writing this book.

How certain are we in this estimate? We have 7 different possible estimates built using different methods and data. 5 were under 1700, one borderline 1712, and one way over (2263). That’s a pessimistic 5/7 (71%) or an optimistic 6/7 (86%) depending on how the borderline possibilities are handled (excluded or included). These calculations use the formula for computing probability introduced last chapter,

p = number of “desired” possibilities / all possibilities.

Desired in our case is equal or below our specified estimate, 1700 total stories.

5/7 = 0.71 = 71% to 6/7 = 0.86 = 86%

Just as important as the estimate, the assumptions made in deriving this estimate need to be captures and confirmed. For this project, here are the assumptions and their investigation –

The number of samples taken explored the likely range of actual values.

At 45 samples, estimate is 96% chance next sample within the range seen so far, only a 2% chance next sample is above highest seen (see Table 3 for a reminder on how this is calculated).

Possible next step to improve on this estimate we would need to story breakdown all of the epics, or start work and get actual samples after epics were built in code.

The epics sampled were picked at random from all parts of the website.

Confirmed epics were picked at random from the entire set by asking the people who chose how they did it.

Possible next step to improve, I would personally have to pick the samples using random process.

The story count estimate for each epic was reliably performed not just invented under duress.

Performed Benford’s Law analysis to confirm likely random data (more later).

Possible next step to improve, I would need to split the group into two and compare results, or observe the splitting myself to hear vigorous discussion and discourse.

Outlier values were carefully evaluated and accounted for appropriately.

Anything 10x the average was split into multiple epics and added back into the total project epic count. 3 outliers 50+ were sampled, increased epics by 15 (5 each).

Feature epic count increased across the board by the same outlier ratio seen in the samples to account for un-samples outliers.

Possible next step to improve, split every epic and get an exact count of outliers. Current exposure, given 45 samples is 2% chance something still above. Worst case 8 higher epics (2% of 400) possible.

The question to always consider, what would be needed to get the next sizable improvement in accuracy. I write these down so I know what I need to do if I’m asked for more detail. I constantly ask myself, is this next step worth the cost? or is the certainty we have now good enough to force a decision. In this case, this quick analysis forced a (correct in my opinion) decision. But in the back of my mind I was ready for the next step if asked. That next step was to start building parts of the website and confirming the story count estimates per epic actuals matched the estimate samples in amount and occurrence frequency. In other words, we need actual data to improve the forecast. We weren’t ready to build, so I had gone as far as needed for a decision, and as far as investment would allow for now.

Tip: Don’t do this by hand. We have a spreadsheet that helps!
Download the Story Count Forecaster spreadsheet (download it here)

Latent Defect Estimation — How many bugs remain?

Not all software is perfect the moment it is written by a sleep deprived twenty-year-old developer coming off a Games of Thrones marathon weekend. Software has defects. It’s far more likely than not that some un-discovered defects are present in any piece of modern day software. Should the current version be shipped to start solving customer problems? If more testing is needed, how much more?

It’s not about zero (known) defects. It’s about getting value to the customer faster for their feedback to help drive future product direction. There is risk in too much testing and too much beta trial time for commercially sold software. Getting a zero known defect count is easy. Delete the bug tracking database or just don’t test for them. The question we need answered rather than how many defects do we know, is how many defect don’t we know. If there was an un-discovered defect, how likely is it we would have detected it by now with the testing we have done?

Yes, you heard right. We want an estimate of something we haven’t found yet. In actual fact, we want an estimate of “if it is there, how likely would we have been to find it.” A technique we introduced earlier for biologists counting fish in a pond becomes a handy tool for answering this fishy question as well. How many undiscovered defects are in my code? Can (or should) we ship yet?

The Capture-Recapture Method

Capture-recapture is a way to estimate how well the current investigation for defects is working. The basic principle is to have multiple individuals or groups analyze the same feature or code and record their findings. The ratio of overlap (found by both groups) and unique discovery (found by just one of the groups) gives an indication of how much more there might be to find.

I first encountered this approach by reading work by Walt Humphries who is notable for the Team Software Process (TSP) working out of Carnegie Mellon University’s Software Engineering Institute (SEI). He first included capture-recapture as a method for estimating latent defect count as part of the TSP. Joseph Schofield of Sandia Labs has published more recent papers on implementing this technique for defect estimation, and it’s his examples I borrow heavily from here (Schofield R. J., 2008) and (Schofield J. R., 2008).

I feel compelled to say that not coding a defect in the first place is superior to estimating how many you have to fix, so this analysis doesn’t give permission to avoid defects using any and all extraordinary methods available to you (pair programming, test driven development, code reviews, earlier feedback). It is far cheaper to avoid defects than fixing them later. This estimation process should be an “also,” and that’s where statistical sampling techniques work best. Sampling is a cost effective way to build confidence that if something big is there, chances are we have seen and dealt with it.

Capture-recapture method assigns one group to find as many defects as they can for a feature or area of code or documentation. A second (and third or fourth) group tests and records all defects they find. Some defects found will be duplicates, and some defects uniquely discovered by just one group. Just like the total tagged versus recaptured tagged fish is used to compute total fish in the pond by ratios in Chapter 1, the common defects versus uniquely discovered defects allow total defect count to be estimated.

If two independent groups find exactly the same defects, it is likely that the latent defect count is extremely low. If each independent group found all unique defects, then it’s likely that test coverage isn’t high and a large number of defect remain to be found; Testing should continue. Figure 9 shows this relationship.

Figure 9 — The capture recapture method uses the overlap from multiple groups to scale how many undiscovered defects still exist. Assumes both groups feel they have thoroughly tested the feature or product.

Equation 8 shows the two-part calculation required to estimate the number of un-discovered defects. First the total number of defects is estimated by multiplying the defect count found by group A by the defect count of defects found by group B. This is then divided by the count of the number of defects found by both (the overlap). The second step of the calculation subtracts the currently found defect count (doesn’t matter who found it) from the total estimated. This is the number of defects still un-discovered.

Figure 10 shows a worked example of capturing what defects each group discovered and feeding them into Equation 8. 3 defects are estimated to be latent, undiscovered. This estimate doesn’t say how important they are, or whether it’s worth proceeding with more testing. It does say that its likely two-thirds of the defects have been found, so things are looking on the right path to success.

Figure 10 — Example capture recapture table and calculation to determine how many defects remain un-discovered.

To understand how Equation 8 works, we need to revisit the fish in the pond capture-recapture equations shown in In chapter 1. The basic capture-recapture ratio formula needs to be rearranged to solve for Total fish in pond, which in this context is the total number of defects for our feature or code. Equation 9 shows this transition step by step (thanks to my lovely wife for the algebra help!).

Equation 9 — The geeky math. You don’t need to remember this. It shows how to get from the fish in the pond equation to the total defects equation.

Like all sampling methods, its only as valid as the samples obtained. The hardest part is getting multiple groups focused on reporting everything they see. The duplicates matter, and people are so used to NOT reporting something already known it’s hard to get them to do it. I suggest going paper to capture these methods. Give each group a different color post-it note pad to capture defects, and the organize on a whiteboard or wall after testing concludes by both groups. Identify duplicates by collating them on a whiteboard, sticking them together if they are the same defect as showing in Figure 11.

Figure 11 — Tracking defects reported using post-it notes.

This type of analysis takes effort, but the information it yields is a valuable yardstick on how releasable a feature currently stands. It’s not a total measure of quality, the market may still not like the solution as developed which is why there is risk in not deploying it, but they certainly won’t like it more if it is defect ridden. Customers need a stable product to give reliable feedback about improving the solution you imagined versus just this looks wrong. The two main capture, recapture experiment vehicles are using bug-bash days, and customer beta test programs.

Bug-Bash Days

Some companies have bug-bash days. This is where all developers are given dedicated time to look for defects in certain features. These are ideal days to set multiple people the task of testing the same code area, and performing this latent defect analysis. It helps to have a variety of skillsets and skill levels perform this testing. It’s the different approaches and expectations to using a product that kicks up the most defect dust. The only change from traditionally running a bug-bash day is that each group keeps individual records on the defects they find.

To setup the capture-recapture experiment, dedicate time for multiple groups of people test independently as individuals or small groups. Two or three groups work best. Working independently is key. They should record their defects without seeing what else the other groups have found, avoid having the groups use a common tool, because even though you instruct them not to look at other groups logged defects, they might (use post-it notes as shown earlier in Figure 11). They should be told to log every defect they find even if its minor. They should be told to only stop once they feel they have given the feature a good thorough look and would be surprised if they missed something big.

Performing this analysis for every feature might be too expensive, so consider doing a sample of features. Choose a variety of features that might be key indicators of customer satisfaction.

Customer Beta Programs

Another way of getting this data is by delivering the product you have to real customers as part of a beta test program. Allocate members at random to two groups, they don’t even have to know what group they are in, you just need to know during analysis. Capture every report from every person, even if it’s a duplicate of a known issue previously reported. Analyze the data from the two groups for overlap and uniqueness using this method to get an estimate for latent defects.

Disciplined data capture requires that you know what group each beta tester is in. A quick way is to use the first letter of the customer’s last name. A-K is group A, L-Z is group B. It won’t be exactly equal membership counts, but it is an easy way to get roughly two groups. Find an easy way in your defect tracking system to record which groups reported which defects. You need a total count found by group A, a total count found by group B, a count of defects found by both, and a total number of unique defects reported. If you can, add columns or tags to record “Found by A” and “Found by B” in your electronics tools and find a way of counting based on these fields. If this is difficult, set a standard for the defects title by appending a “(A)”, “(B)” or “(AB)” string to the end of the defect title. Then you can then count the defects found only by A, by B and by both by hand (or if clever, search).

There will be a point of diminishing return on continuing the beta, this capture — recapture process could be used as a “go” indicator the feature is ready to go-live. In this case, you can keep the analysis ongoing until a latent defect count hits a lower trigger value which is an indication of deployment quality. Using this analysis could shorten a beta period and get a loved product into the customers’ hands earlier with the revenue benefits that will bring.

Tip: Don’t do this by hand. We have a spreadsheet that helps!
Download the Latent Defect Estimation spreadsheet (download it here).
Download a single page sheet Latent Defect Estimation (download it here)

Statistical Rule of Three

The Rule of Three answers a slightly different question than how many defects sit undiscovered in a piece of code. This rule answers “After checking a number of test samples, and not found ANY issues, how likely is it there are none?” The Rule of Three says, if you are truly looking at random, and don’t find anything, how likely is it there are no problems in the total population. The complex formula for probability no issues exist after N samples, having seen no issues –

3/N

That’s right, 3/N. Seems like it should be more complicated, but because of some handy characteristics when numbers in longer formulas are close to zero, much of the complexity can be eliminated. I first heard of this through John Cook, the unflappable mathematics guru mention in chapter 2. He offers insights from years of consulting and teaching statistics in many fields. He wrote an insightful blogpost introducing the logic behind and mathematical proof of the Rule of Three.

Estimating the chances of something that hasn’t happened yet
Suppose you’re proofreading a book. If you’ve read 20 pages and found 7 typos, you might reasonably estimate that the chances of a page having a typo are 7/20. But what if you’ve read 20 pages and found no typos. Are you willing to conclude that the chances of a page having a typo are 0/20, i.e. the book has absolutely no typos?
…
The rule of three gives a quick and dirty way to estimate these kinds of probabilities. It says that if you’ve tested N cases and haven’t found what you’re looking for, a reasonable estimate is that the probability is less than 3/N. So in our proofreading example, if you haven’t found any typos in 20 pages, you could estimate that the probability of a page having a typo is less than 15%. …
Note that the rule of three says that your probability estimate goes down in proportion to the number of cases you’ve studied. If you’d read 200 pages without finding a typo, your estimate would drop from 15% to 1.5%. But it doesn’t suddenly drop to zero. I imagine most people would harbor a suspicion that that there may be typos even though they haven’t seen any in the first few pages. But at some point they might say “I’ve read so many pages without finding any errors, there must not be any.”
…
(Cook, 2010)

The key takeaway is -

“Just remember that if you haven’t seen something happen in N observations, a good estimate is that the chances of it happening are less than 3/N.”

The mathematical proof is a little more complex, so I’ll let John explain -

“What makes the rule of three work? Suppose the probability of what you’re looking for is p. If we want a 95% confidence interval, we want to find the largest p so that the probability of no successes out of n trials is 0.05, i.e. we want to solve (1-p)n = 0.05 for p. Taking logs of both sides, n log(1-p) = log(0.05) ≈ -3. Since log(1-p) is approximately –p for small values of p, we have p ≈ 3/n.”

Applying this to software development, it comes in handy for code reviews, specification analysis, or testing. It helps understand how much checking at random should occur to avoid having to check everything. This is handy when its expensive to check everything, but it feels too risk to check nothing.

I’ve seen it applied recently to get confidence a set of servers were deployed and configured correctly. 30 servers at random from a pool of 100. When these all were hand-checked and configuration confirmed (no extra accounts, the right services running, etc.) it was likely there was less than 10% chance any server was ill configured. Not for certain, but this sample set gave confidence to go live without having to check every server immediately and continue business. Of course automation could have (and did in later releases) check all 100, this was an additional check to confirm likely success.

The beauty of this formula and technique once you get a feel for it, is demonstrates just how quickly uncertainty is decreased by sampling, but never reaching zero unless every sample in your group is checked. Figure 12 charts the rule of three equation. You can clearly see how fast certainty drops from 75% chance something undiscovered is likely to exist, to 10% chance by 30 samples. It also shows that the result NEVER gets to zero chance. Sampling helps you be more informed, not certain!

Figure 12 — The equation 3/N charted. Samples along the X-Axis, probability something might exist but you haven’t seen it yet on the Y-Axis

Summary

Sampling is a powerful technique to cost effectively get probabilities about story counts, defect counts and how likely something not seen yet might be there but not found. Fewer samples than necessary to get answers likely better than gut instinct alone.

Key points and tips discussed in this chapter:

Sampling does solve real world software problems.
Always document the process and assumptions used in a forecast estimate.
Always assess what the next step would be for improving an estimate. Stop when you have enough for an answer.
Use sampling techniques to know when to stop, for example, keep looking for defects or ship with confidence