Estimating, How Big

troy.magennis
Forecasting using data
21 min readJul 8, 2017

Chapter 7 — DRAFT

This chapter looks at estimating how big a feature or project may be. Knowing how big something is an important input for any model looking at forecasting how long or how expensive.

Goals of this chapter

  • Learn three different approaches to estimating project or feature size
  • Learn what approaches are preferred and why

Authors NOTE: I’m not totally happy with this chapter. Too much time explaining things I don’t think should be used. Would love your feedback on what you think is the right content for this chapter.

Estimating Feature or Project Size

How much work needs to be completed is an obvious input for any time and cost forecasting model. Without an estimate of size, a tiny project would be indistinguishable from a massive project in delivery time and this doesn’t make logical sense. Size is the starting point in order to judge progress and tell when “finished” is reached.

Finished is often has an evolving definition. Don’t rush understanding what “finished” means. If some parts of a feature can be delivered separately, then it may be a good strategy to split the feature or project into two different deliverables. You don’t have to ship them separately but have that option if you build iteratively. This can help get feedback and/or revenue earlier, and it definitely helps focus on delivery.

Split a feature or project if possible

Splitting features and projects into smaller batches helps keep uncertainty to manageable levels. People are better at judging small to medium size features and projects versus large and huge sized projects. Always look for ways to split work into smaller batches before estimating and forecasting begins.

Once we have a unit of deliverable work we want to estimate size for, the next decision we have to make is what unit of measure for “size” to be expressed. Commonly used size units in the software world are –

  • Count of features or feature stories (lowest effort)
  • Sum of size buckets (small, medium, large) for feature stories
  • Sum of arbitrary size effort units called “story points”
  • Sum of expected calendar time to build the feature stories (maximum effort)

The choice boils down to how you can estimate or measure completion pace. For our car journey example, we use miles (or kilometers) for distance to travel which is equivalent to size in this software model. We measure pace in miles or kilometers traveled per hour. Although we can convert from miles to kilometers, we tend to stick with the same unit for size and pace to avoid simple mistakes (like the Mars Climate Orbiter[1]). Forecasting how long something will take is often as simple as dividing the amount of work (size) by the delivery pace. This tells us how many units of “pace” it takes to travel or complete all work. In other words, how long.

A general rule when picking a measurement unit to use for estimating, is to use the one that gives maximum predictive power for the least effort. If two units of estimation give equal predictive power, use the one that takes the lowest effort to gather. All of the options we have for software size require some breakdown of features into stories that will build that feature. The first and easiest option is to just get a total story count. This is a perfectly fine choice as long as it proves to be predictive.

Total story count as size, and completed stories per week is often a winning combination. If the relative combination of different effort stories differs massively week to week, then it will fail and a measure that incorporates a scaled size of each story (story points or a calendar time estimate) will be necessary. I’ve not seen this occur in software projects unless there are a few particular feature stories that were really another feature themselves in magnitude. If there are outliers like this in the feature stories, then they either need to be broken down into smaller stories, or some other way to show different magnitude used. The often used way is to allocate “story points” to the stories. Bigger stories are allocated more points.

I’ve not seen story point forecasts perform better than plain story count when a reliable sampling approach has been applied to the story count estimate. In Chapter 6, we saw a real world application of sampling to estimate total story count. There would have been no utility in allocating story point estimates in that process. A decision was made without extra team effort of allocating points, and through segmenting the data and comparing each segment (back-testing) there was no evidence of significant outliers that would invalidate a story count forecast. No story point estimates were necessary in that case.

If you want to confirm story-point estimating is un-necessary in your data, take random groups of previous or current story point estimates and calculate the average. If the average is stable, then story count would give the same predictive power without the effort of the team arguing for days on end (or homicides or suicides). I tested this assumption on data from one-hundred teams doing every task from marketing to code development. When teams use the typical Fibonacci scale of 1, 2, 3, 5, 8, 13, 21 then the average was (and will be) slightly less that 5 and the median was (and will be) around 3. Test your data, and if you see this outcome, stop performing story estimation using planning-poker (if you don’t know what this is, we discuss it soon) if it is JUST for forecasting purposes. You might still do point estimation it to clarify everyone understands what is being built, but throw them away when predicting how long.

Doesn’t story size matter?

Probably not. It depends on your development and delivery process, but often system factors account for more of the elapsed delivery time than different story sizes.

Consider commuting to work by car each day. If the road is clear of traffic, then the distance traveled is probably the major cause of travel time. At peak commute time, it’s weather and traffic congestion, and it would be difficult to estimate travel time based on distance alone. For software development, if one person (or a team) could get the work and be un-disturbed from start to delivery, then story point effort estimates will match delivery time. If there are hand-offs, dependencies on other teams, expedited production issues or other delays, then the estimate will diverge from estimate.

This is measured in “process efficiency” which is the ratio of hands-on time divided by total-time. Often for software development this is between 5–15%. Meaning even if we nailed the effort estimates in pints, we would be accurately predicting 5–15% of elapsed delivery time! We need to find ways to accurately forecast (or remove) the non-work time influenced by the entire system.

If it’s not obvious, I have a clear bias towards story count. It’s not that story points are wrong, but I’ve seen them overused and given mystical predictive properties that don’t exist in the real world. Just like we saw in Chapter 4, 5 and 6 though, I encourage you to use every method that you have at your disposal and comparing and averaging the results. Once you determine points and count yield similar results, I encourage you to stop the one that is more effort and free teams from story point estimation.

Whatever method is chosen, always estimate a range rather than a single number. The ideal range target to set teams is 90 percent chance the eventual actual will fall within the estimated range. 90 percent range means that its possible 5 out of 100 times the estimate falls above or below the low and high specified. If you ask for a guarantee of minimum and maximum, people often make the range huge to account for every conceivable issue they can think of, no matter how remote. This uncertainty propagates into the final forecast meaning results like sometime between 2016 to 2018. I didn’t need estimate to work that out!

There are multiple ways to get a story count estimate. These same methods work for any other unit you choose as well, I’m just going to focus on story count for this discussion. Here is a short list of facilitation techniques in order of highest preference to least preference that achieve rapid reliable estimates of story count based on cognitive science.

Relative Estimation (AKA. Reference Class Forecasting) — EASIEST and BEST (IMHO)

One way to get a quick estimate of how many stories a feature is composed of is to see how many stories it actually took to deliver a similar feature. Keep an ongoing record of the number of built stories for delivered features. Make these reference features available to the team when they are estimating a proposed future feature.

A powerful way to use reference class estimation is to print prior features and their estimates and actuals on individual sheets of paper. Position them along a table from smallest size to highest. Ask the team to compare the proposed feature with those prior features and position a post-it note where they feel it fits along the continuum. I get them to pick an optimistic and a pessimistic size, and make those your low estimate and high estimate for this feature. When that feature is completed, add it to the list of reference features for next time.

A typical relative story count estimation process is (also see Figure 7–1) –

1. Lay out prior features across a table from lowest story count to highest story count.

2. Introduce the feature being estimated. Someone needs to explain what the feature is in enough detail for the team to understand likely stories and difficulties.

3. The group picks a position on the table where this feature fits relative to the other prior features.

4. There should be some debate. Aim to get the group to agree on an optimistic (low count) and a pessimistic (high count) similar feature. Use these as the low count estimate and high count estimate.

Figure 7–1 : Reference class forecasting story count by comparing to previously completed features. Step 1, introduce prior features. Step 2, introduce the new feature. Step 3, position and estimate the story count range. Note: Feature 3 is still going, so no “actual” count.

When a team gets used to finding similar features from the library of historical delivered features, this type of estimation will be rapid and painless. The team can self-calibrate by spending some time comparing their estimates against the actual count for delivered features. Teams quickly identify why they missed and apply that knowledge to future feature estimates. In a perfect world, story counts for each feature would be an asset that organizations cherish and lovingly maintain.

To capture the historical data and help the team maintain feature count calibration over time, I keep data in the form shown in Table 7–1. It doesn’t have to be complex. Reviewing it prior to assessing a proposed feature helps the team remember successes and misses.

Table 7–1: Example of a historical story count table. Keeps a record of prior estimates and how they turned out. Helps the team learn why they missed an estimate range, and find similar features next time.

Beyond its efficiency, there is real science behind why this way of estimating will perform better than others. It forces the group to take an “outside-view” of the feature and avoid overly considering specifics of this feature before understanding how much work things like this feature have historically endured. It avoids the group falling into a bias known as the Planning Fallacy.

The planning fallacy bias loom large whenever a bigger plan is built using smaller tasks or steps. When planning in detail, estimates more often fall on the optimistic (best-case) side, causing chronic underestimation when combined. Planning each small step or task individually misses system level delays and difficulties that interconnect each step or task. Taking an outside view by looking at how long (or big) roughly similar previous features took includes not only the steps and tasks, but also the wider system delays that are hard or impossible to predict in advance.

This technique is also known as Reference Class Forecasting, described in works by Amos Tversky and Daniel (Kahneman & Tversky, Prospect Theory: An Analysis of Decision under Risk, 1979) (Kahneman & Tversky, Intuitive prediction: Biases and corrective procedures, 1977). The concepts behind reference class forecasting earned Kahneman a Nobel Prize in Economics. Their recommendation is to find similar past situations, use those as the base level and adjust your case up and down based on specific context and circumstances. Similar work and practical conclusions come from Bent Flyvbjerg, who looks at how major infrastructure projects have performed poorly versus budget (Flyvbjerg, 2006).

This is a great habit to get into whether forecasting things in your software projects and features or just life. The key is finding some data that sets the base rate for your estimate, then draw conclusions for your specific case. Discuss the specific reasons why you might assume your case will be more than the base or less than the base rate, and by how much.

For example, you might be tasked with forecasting how much work in delivering a new email template. Looking at Table 7–1, Feature 2 was an email template. It took 3 stories of work to deliver (counted after delivery, so this is an actual). The new template might consist of 10 embedded fields, double what Feature 2 had. It would be suspicious if the estimate for the proposed feature was less than 3. It’s much more likely its larger. The team should discuss how many more stories there are for doubling the number of fields. They might settle on 50% more. This leads them to say 4.5 stories minimum (rounded up to 5) and allowing for it being 100% more, double the increase (3 + 2 = 5 + 2 = 7) to make the high estimate 7. Final answer: 5 to 7 stories. This fits the historical performance and logic. It’s a bigger feature and more work is needed.

The form of recording historical data shown in Table 7–1 captures both the base rate expected for similar features, and also the reasons some features encounters a different outcome. If I could urge software companies to do one thing, it would be to keep this data up-to-date and available to all staff during planning sessions. I often generate this data by diving into historical data, but its far easier just to keep an ongoing record over time. The narratives of why the estimates missed actual are an incredible resource for seeing where system improvements will pay the biggest dividend.

Every feature will be unique, and if the team struggles to find something similar to compare against, the discussion should help move onto one of the other techniques. The next one being, getting the team to estimate a story count low and high range for this specific feature.

Estimating a Low and High Story Count Range with 90% confidence — GOOD

If there aren’t any prior features to compare a proposed feature with, then an estimate of the low and high boundaries needs to be performed. Estimating a range rather than a single point offers multiple advantages. The first is that groups spend less time locked in heated battle about agreement on a single number. The second is that the range estimate more honestly captures the uncertainty of an estimate versus a single number. The team goal is that the eventual actual story count falls within the range given, nine times out of ten (90% confidence).

It takes practice to become competent at estimating a 90th confidence range. In his book The Failure of Risk Management and how to fix it, Douglas Hubbard (Hubbard, 2009) describes techniques to help calibrate experts into giving superior 90th percentile range estimates. His techniques take would-be estimators through practice questions in order to help them understand how over-confident they are, even when they are wrong! It turns out calibrating on question with nothing to do with the feature at hand, but general knowledge achieves calibration of people assessing their own biases. Once they are good in general at understanding how certain they are versus how certain they think they are, the same mental muscles apply when forecasting something software related.

Start each estimation session by asking the assembled experts to determine the 90th percentile range of a question where the answer can be immediately determined by Wikipedia or Google. Coach the assembled that it’s not a competition to find an exact estimate, it’s a competition to make sure that they are confident the final actual result is somewhere in the range they choose nine out of ten times. The training questions in Douglas Hubbard’s book are simple questions to research (make sure no smart-phones and laptops are open, Wikipedia knows all), but unlikely that anyone in the room knows the answer already. Some examples are: “What is the height of the Statue of Liberty?” and “What was the population of China in 2000?” Your goal is to help the experts improve their mental process for finding the lowest they think probable, and the highest they think probable, and the all-important — why?

At first, many experts choose a narrow range, almost trying to impress the facilitator as to how small a range they can guess. Hubbard (Hubbard, 2009), offers some strategies on how to get the expert to understand their certainty better. His advice (and others quoted in his book), suggests that humans get very well attuned to risk when money is involved. Offering someone even a hypothetical chance to win or lose cash rewards or punishment tunes their risk radar that little bit more. His technique gives the expert a choice, one that is absolutely a 9 out of 10 chance of winning, and then the asks them whether they prefer their chances on their estimate choice, or the fixed choice. Here is a hypothetical conversation that uses this method -

You: “How many stories do you think are in the new feature adding the new hotel booking pages? There are five of them, but let’s just focus on how big delivering one of them is.”

Expert: “5 stories.”

You: “OK, but we are after a low-bound and high-bound story count here, a range that you would be confident, 90% confident to hit.”

Expert: “3 to 7 stories.”

You: “Alright, remember we are looking for a range that would be 90% certain of hitting, missing one time in ten. Let me make it interesting to explore how confident you are in your range, What if I offered you a choice –

Option A: You will win $1,000 if your colleagues agree and their ranges turn out to be similar to yours or narrower than the numbers you gave for the upper and lower bounds. If not, you win nothing.

Option B: You draw a marble at random from a bag of nine green marbles and one red marble. If the marble is green, you win $1,000. If it is red, you win nothing (i.e., there is a 90% chance you win $1,000).

About now, you will see if the expert truly trusts their range, and most often they won’t — they are probably somewhat certain, but not 90% just yet. It seems so obvious that the bag of marbles has a much greater chance of being true, and that would only be true if they had some misgivings about their range.

Expert: “I might be a bit low on the high estimate. Considering these pages have a lot more dynamic code, and we need to test multiple browsers, in hindsight, I’d like to increase both ranges. Low bound of 5 stories minimum and high-bound of 12 stories.”

This type of technique is called an Equivalence Bet, or the Equivalence Urn method for subjective risk determination. It works because human intuition grasps the concept that drawing the marbles from a container has an equal probability for every marble. It’s easy for the brain to compute the likelihood of getting odd-color-out marble. Less so than their estimate, but the mental process of equating something absolutely 90% versus their estimate should help widen the range to acceptable — or at least test how confident they are in their peers.

Avoid anchoring the group around a single number plus or minus some variance by phrasing the question lie “How many stories?” which is asking for a single value estimate. One technique to solve this dysfunction is to estimate the low count estimate first, then after agreement moving onto the high. By explicitly asking “What could the lowest count be?” and “What could the highest count be?” you are anchoring at the edges rather than the center. If average plus or minus thinking continues to be a problem, split the group into two and have one group estimate the lower bound count, and the other group estimating the higher bound count. Get them to record their estimates secretly on a post-it note. Swap the groups and see how the numbers eventually compare. If they agree, you have a reliable forecast. If they differ, you have more work to do.

A typical story count range estimate process goes like this –

1. Start with some calibration exercises. Nothing too long, just something to re-enforce the 90% confidence range estimating process.

2. Introduce the feature being estimated. Someone needs to explain what the feature is in enough detail for the team to understand likely stories and difficulties.

3. Can the feature can be split into smaller features and delivered separately? Split the feature if possible, and make it clear to the group what feature you ARE NOW estimating.

4. Discuss different approaches to delivering the feature. Decide what approach the group is pursuing. Failure to do this will have half the group estimating one thing and the other half estimating something else.

5. Ask: What would be the likely minimum number of stories needed to do this feature. Discuss and find some agreement.

6. Ask: What would be the likely maximum number of stories needed to do this feature. Discuss and find some agreement.

7. Challenge the group. Make an equivalence bet. Ask the group if they would take the equivalence bet of winning or losing $1,000 hypothetical dollars if the actual falls in the range, or is they would take the chance of pulling one green ball out of a bag with 9 green balls and one red (as discussed earlier). Help them adjust their confidence in the low and high estimates.

Whilst estimating like this is OK in the short-term, move to relative (reference class estimating) as soon as you have a three or four features under the teams belt. Performing one off estimates like this is a stop-gap measure because it lacks the lessons from actual experience.

Estimating a Single Story Count Number — BAD

This is my least favorite way of estimating story counts, and the least successful when it has been used in my presence. It’s also the most common. It involves asking people to give a single estimate of story count. It hides how uncertain a team or groups is about a feature, and has the lowest chance of actual result matching the single spoken estimate. If that introduction isn’t enough to turn you off, It’s also the most time consuming because the team spends significant time finding agreement and compromise. No-one is every happy.

I had to mention it here because it’s how the client estimated size for the example in Chapter 6. In their case they outsourced facilitation, and being paid by the hour and believing more talk equaled more accuracy; the vendor made a killing. To be fair, it is also the most common way popularized by current publications that propose this and only this style of estimation. They even invented games to make it simpler to get a crappy result.

This method will be forced upon you if a forecasting mechanism is used that can’t handle uncertainty. It does take skill to use range estimates instead of point estimates for forecasting, and it’s probably the lack of street knowledge about how to do this that drove you to read this book. I’m hoping by the end of this chapter, single-point estimates won’t be a technique you ever consider using again.

There are two main techniques for getting a single story count –

1. The team breaks down the feature into feature story detail.

2. The team estimates through voting or consensus how many stories they estimate.

Groups of people struggle to find consensus. More often than not, when put in a position where individuals have to have a say, decisions fall along organizational power structures. Highest paid person’s opinion. “If we are just making a decision using opinion, then I’d prefer it to be my opinion.” is a quote heard from a software executive by a colleague[2]. For bold groups of people, the discussion can be exhaustive diving into details because “it has to be right.”

The most important aspect of any estimation method described here in this book is that that everyone understands exactly what they are being asked to estimate. Sounds simple enough, but a single line written description of a feature builds very different ideas in the individual brains. Often the best way to assess this is to get everyone to disclose their first thoughts about size in unison. If everyone agrees, then it’s likely that the group has understood the problem and the answer likely right. This is a variation of the logic applied in the capture-re-capture process discussed for determining latent defect count, except for story count estimates.

For teams who use story points rather than count, the most common technique used by Agile teams is a technique called Planning Poker. Planning Poker assembles a group of “experts” who vote on story size unison to formulate a group consensus estimate. Voting in unison avoids people being biased by how their boss voted. They have to use their own opinion. That’s perfect logic, if everyone has the same experience, knowledge of the subject, and has the foggiest of clue what was being described. For getting consensus of equal experts, marvelous. To get a well-considered estimate for different levels of skill and subject matter knowledge. Appalling.

Planning Poker proponents made some additional alterations that make it worse. They introduced a fixed set of responses using Fibonacci style sequencing of values to 1, 2, 3, 5, 8, 13, 21. These values are printed on cards and when a vote is carried out, the assembled group hold up a card in unison. This change was intended to avoid people being unable to agree on small estimate differences, and to show that larger estimates of effort are less precise. But in practice, it just causes 3 or 5 to be most common. Once you limit the responses, and apply pressure to make stories smaller, people avoid high numbers unless they think everyone else would agree. You get called upon to explain if you are “different” than the group, and many people don’t feel safe being put in that position. I looked at 100 teams estimates from a large software organizations and found estimate patterns almost identical. Some groups centered around 3, some around 5, but universally, the higher numbers dropped off exponentially. Only small part of the available range was being used making it really a rough average. (Tamrakar & Jørgensen, 2012) documented similar results in a paper looking at how to improve software estimates.

This narrowing of values causes large forecasting errors even if you are just one-off. Even if “3” was agreed, if the group was one off in comparison to other work of the same effort, they are 1 unit better off if it was actually a “2”, or 2 units worse off if it was actually a “5”. If they estimated “5” and were one off, they could be 2 units lower or 3 units higher. This asymmetry of being off by one errors (bigger errors positive direction than negative direction) cannot be accounted for by dividing size by some pace or velocity.

Some great work about the efficacy of Planning Poker comes out of the research organization from Simula Research Laboratory in Norway. They have many papers on estimation backed by studies. One particular paper by (Haugen & Moløkken-Østvold, 2006) called Planning Poker — Playing for better estimates has a great set of ideas and case study data that shows where Planning Poker works and why.

Planning poker is great at hearing a range of effort estimates. So, even though I say not to use it for forecasting, the process and discussion is a perfect application of Carl Sagan’s baloney toolkit. It is exactly the discussions had during planning poker that gives the group confidence to weigh in on an estimate. The desired group outcome from planning poker is a shared understanding of the approach to work, if there are missed complications and effort, and how much agreement there is from people of different skills and background. Use this evidence to confirm the right stories are in the backlog, at the right size and split. Throw away the planning poker numbers after that, they offer little insight for forecasting!

Planning poker generates the right baloney detection questions, turning guesses into estimates

Use planning poker to look for complications, story splitting opportunities and alternative approaches. Then throw away the number. Story counts are cheaper to obtain, and have proven a stable forecasting value. They also don’t need every piece of work to be analyzed for a piece of work to be forecast.

Taking the best parts of Planning Poker for story count estimation, here is the recommended process –

1. Introduce the feature being estimated. Someone needs to explain what the feature is in enough detail for the team to understand likely stories and difficulties.

2. Can the feature can be split into smaller features and delivered separately? Split the feature if possible, and make it clear to the group what feature you ARE NOW estimating.

3. Discuss different approaches to delivering the feature. Decide what approach the group is pursuing. Failure to do this will have half the group estimating one thing and the other half estimating something else.

4. Get the group to write a story count estimate on a piece of paper and then disclose that on the count of one-two-three.

5. Discuss the ranges shown and determine the reason for any major discrepancies. Agree as a group what the final single estimate count will be.

Don’t use the Fibonacci cards. They are for story points, and will clump around 3 to 5 in more cases than likely probable. Move to relative estimation (reference class forecasting) as soon as you can. Single point estimates are playing in the danger zone.

Summary

This chapter has outlined the ways to estimate size quickly with a strong preference to relative estimation based on something you have delivered before, and the use of ranges rather than a single value.

Features and project size estimates are a starting point, but there will always be some work we don’t know about. The next chapter looks at how to estimate how big that growth will be so it is included in our forecasts properly.

[1] See https://en.wikipedia.org/wiki/Mars_Climate_Orbiter, One company used pound-seconds (lbf s) the other newton-seconds (N s), On orbit insertion to Mars the orbiter went too close and disintegrated.

[2] Personal communication by email from Donald Reinertsen when discussing how teams prioritize work based on cost of delay.

--

--