How we solved task assessment issues with story points

Space307

Published in

Space307

9 min readMar 31, 2023

Here’s how it went down.

Our issue

We have a core team here at Space307. The core team members used to work in one-week sprints, with developers taking on tasks in batches and aiming to complete them by the end of the week. Unfortunately, things would go south more often than we wanted.

Eventually, there came a point where we could no longer ignore the bus factor. Our developers knew their own sections of the code well, but were unfamiliar with others. Because of that, we often experienced delays or had to suspend operations because of vacations, sick leaves and layoffs until we could find someone to fill the gap.

Here’s a bit more background info: The core team consists of backend developers and QA engineers who are simultaneously developing and maintaining about 70 microservices. Our tasks are diverse and don’t depend on long-term product epics much. At the same time, we have our own relatively large and drawn-out technical projects, for example, refactoring activities, or development and implementation of new solutions.

In search of a solution

We realized that we needed to change the way things work, and came up with a new system where tasks are distributed randomly. Any team member could be assigned any section of the code, even if they’ve never worked on it previously. The main benefit we had hoped to derive from this new approach was knowledge exchange. The developer with the most expertise in a given area would manage the process and set deadlines, and in the end, every team member would become a jack of all trades.

After we implemented this new system, we realized that there’s a strong correlation between how much any given developer knows about a particular code section and how quickly they can complete their tasks.

So, we started scoring the tasks using this table. Some tasks were rated “Cakewalk,” while others “This is fine.” However, this scoring system wasn’t helping us meet weekly deadlines with a reasonable number of tasks per employee.

Life-saving story points

We struggled for a while, and then considered introducing the story points system. We’ve all heard about how effective and helpful it is, but none of us had any idea of how to actually use it.

What was said was done. We launched Jira sprints and installed apps on our phones. We even made cards for planning poker, but the main question remained unanswered.

How do we actually rate tasks? Well, let’s talk it over!

What are story points?

In his book Scrum: The Art of Doing Twice the Work in Half the Time, Jeff Sutherland wrote, “Now the job is to figure out just how much effort, time, and money the project will take. As I’ve already pointed out, we humans are absolutely terrible at this, but what we are good at, it turns out, is relative sizing — comparing one size to another.”

When we estimate with story points, we assign a point value to each task. The raw values we assign are unimportant. What matters are the relative values — a story that is assigned a two should be twice as large as a story that is assigned a one. It should also be two thirds of a story that is estimated as three story points.

This kind of evaluation is based on four factors:

Amount of work

This factor reflects the number of actions required to complete the task. Suppose you need to create one form field with a label element. This task is estimated at one story point (SP). Then there’s another task to create 50 fields. They are all exactly the same as the field in the first task, but that doesn’t mean that this task will take 50 times longer — perhaps we can just write the code once and reuse it 49 times, as these elements are not interdependent and behave in predictable ways. So in the end, the second task doesn’t require 50 times more work than the first, but only perhaps two or three times as much (just because of the amount of fields).

Complexity

Now, let’s look at another task. This time, we need to create 50 fields that can be used for different input types (numbers only for some, text only for others). Additionally, some fields’ values may change validation rules for other fields or enable Hide Field — for instance, SWIFT codes that vary depending on the country may follow this rule. To complete this task, you also need to test all possible combinations. It’s significantly more complex than our first example.

Risks

If implementing a feature involves changing some old code with no automated tests in place, that poses a risk — the result can be unpredictable and inconsistent with the source. Complex mathematical calculations, high cohesion and numerous dependencies between components increase the risk of missing details when conducting tests.

Uncertainty

If the task owner is unable to set a clear implementation framework, or there’s a risk that the task requirements may change either during implementation or after the first results emerge, factor in uncertainty when rating the task.

The story points system can be presented as the following formula:

Story points = f(Volume, Complexity, Risks, Uncertainty)

Now that we know which factors go into the story point assessment, let’s use this model to score a task:

First, choose a reference task. It’s best to go with a task that requires the least effort.
Next, compare other tasks to your reference. If your reference task is estimated at 1 SP, then another, slightly bigger one would probably get 2 SP, a moderately complex task would score 5 SP, and a total mammoth of a task would clock in at a whopping 25 SP.
Remember that you don’t have to estimate the time required to complete the tasks. You just need to compare them to your reference.
Here comes the first iteration (or first sprint).
Think about which of the assessed tasks you can complete within one sprint. Then compare the sprint result to your forecast. Your team’s sprint velocity is the total SP value of all tasks completed within the sprint. Plan your next sprint with this number in mind.
After a couple of iterations, you will get an idea of how many SP you can handle per sprint on average. This value indicates the team’s velocity.
When planning another sprint, consider not just your average velocity, but also how much the team can actually do right now. Sometimes sprints are shorter because of public holidays, or the team is short-handed because someone is ill or on vacation. This is why we need to estimate capacity — velocity adjusted for reality. It indicates how many SPs your team can do over the next sprint, considering the current conditions.
This is a simplified way of explaining the approach, but explains its concept well enough.

Scoring methods

Before scoring a task using the story points system, you have to choose a scoring method. Agile development offers several options. In this article, we’ll take a look at three of them — the Fibonacci sequence, t-shirt sizing and planning poker.

The Fibonacci sequence is a popular method of SP assessment. The possible SP values are 1, 2, 3, 5, 8, 13, 21 and so on. It’s based on the Fibonacci sequence, where each number is the sum of the previous two. This scale shows that the level of effort required to complete tasks tends to increase exponentially rather than linearly.

T-shirt sizing is another popular agile scoring technique that uses common t-shirt sizes (XS, S, M, L and XL). It’s often used when the amount of effort required to complete a task is uncertain. T-shirt sizes are a great way to rate tasks quickly without going into too much detail.

Planning poker is a collaborative method of SP rating. Team members discuss their tasks and determine the amount of effort required to complete them. Each team member receives a deck of cards with values corresponding to different SP amounts, and then chooses a card with an estimate they think is perfect for a particular task. After that, the team discusses the assessment and achieves collaborative consensus.

From our experience, the scale or method used for scoring is less important than team coordination and scoring accuracy. Choose an instrument that works best for you. You can review the scoring process later and adjust the scale or method.

We initially picked the Fibonacci sequence. This is what our scale looked like:

It took us some time to get used to the new system, but we adapted after a couple of months and our team members managed to broaden their knowledge of different product elements. What’s more, the ways people rated the tasks were lining up.

Challenges

1. How to score tasks when opinions differ

We explored several options.

1. Using the highest score

This way, the complexity of tasks is often overestimated, and too few tasks fit in one sprint.

2. Using the lowest score

This approach resulted in underestimating tasks and ruining weekly plans.

3. Rounding up the average score

The result was similar to that of the highest score approach.

4. Rounding down the averaged score

The complexity of tasks was still underestimated but not as dramatically as in the lowest score approach.

5. Using the median score

This approach turned out to be perfect for our team.

We recommend trying as many approaches as possible, because the result depends on the team.

2. Team capacity and how many SP is enough for one sprint

The best solution for this problem is patience. Just like choosing the right approach to scoring requires time, the same goes for determining the right amount of SP per sprint. After a few sprints, assess your team’s structure and collect the data. Keep a record of sick leaves, vacations and other absences, and calculate your team’s capacity under the perfect conditions. Use these calculations when planning the next sprint and factor in the availability of team members.

3. If 5–8 SP is complicated, 11 SP is unrealistic

We realized that a 5-SP task is already too challenging to be completed by one person in one sprint. As such, 8 SP is definitely not achievable within a week. As for 11 SP, we eventually phased out this score altogether because it turned out that even 8-SP tasks were not properly decomposed.

That is why two caveats followed:

If the task is rated at 5 SP and can be handled by several developers and completed within one sprint, it’s a feature (or a story, depending on which term you prefer).
8-SP tasks are definitely Epics. They are too big for one sprint and need to be broken down. We’ll talk about the task types in another post.

4. We took on a bunch of 1-SP tasks and were done with them halfway through the week

Some of our sprints comprised a lot of simple 1-SP tasks. These tasks seemed to fill up the capacity on paper but, in practice, they were too easy to last a full sprint. This was an a-ha moment. It turned out that some tasks require only 5–10 minutes to complete. They are too simple to be rated 1 SP, but must still be taken into account within sprints to keep track of weekly performance. Now, we rate these kinds of tasks as 0 SP.

Our solution

Here’s what our current scoring scale looks like:

Outcome

Scoring tasks with story points helped us reduce the bus factor effect, correctly assess our sprint capacity and improve planning. The new system made our process transparent. Now we finally know how to answer the fundamental question of the development world: “When will this task be completed?” With story points, we can work closely together and give objective estimates of the effort required to complete tasks. Team members discuss their tasks and make scoring decisions based on group consensus. This approach is critical to enhancing team communication and the work environment.

Our tech teams are now working on introducing quarterly planning. When you know your team’s velocity, it’s easy to predict how much work can be done over a quarter and factor in vacations, sick leaves, operational activities and other risks. Allocate enough time for task breakdown, discussion and planning to develop a quarterly roadmap that matches reality. This kind of approach will help you eliminate dependencies on the other teams and plan everyday activities.