Software Engineering Leadership: Past and Present

What has changed about software engineering leadership since Royce published his famous paper?

Milo Todorovich
CodeX
16 min readJan 16, 2023

--

Software engineering has reached the prime of life. The term was originally coined in the 1960s to describe the system interaction between hardware and software components.

The first software systems were developed in tandem, both hardware and software. These projects could take 18 months or more. This is a stark contrast to where we are today, where an entire offering can launch in less than a week, then iterate to serve an expanding user base.

A Summary of the Waterfall Paper

I first learned about waterfall during my first job as a consultant in the mid-90’s. In 1970, Dr. Winston Royce presented a paper, Managing the Development of Large Software Systems, sharing his thoughts about leading successful software projects based on his experience building aerospace systems. The diagrams in the paper looked like a cascading flow, a waterfall, from one step in the development process to the next.

Royce defines success as when the system is in “an operational state, on time, and within costs.”

Figure 10 from Managing the Development of Large Software Systems
Figure 10 from Managing the Development of Large Software Systems

The context for Royce’s paper comes from nine years of developing software packages for mission planning, commanding, and post-flight analysis.

No matter what size or complicated software system you have in mind, there are two minimum steps required to develop software: Analysis and Coding. These steps apply when the same person builds and operates the software. A more extensive system is “doomed to failure” if built using just these two steps.

For larger projects to succeed, the required steps are:
1) System Requirements
2) Software Requirements
3) Analysis
4) Program Design
5) Coding
6) Testing
7) Operations.

Customers do not see value in steps other than Coding. However, all of the steps in the process are critical to a successful outcome. These steps form the backbone of today’s software development life cycle. The names have changed, the spirit of each step remains the same.

There is an iterative interaction between the steps. For example, testing can influence Program Design. Likewise, Program Design can influence Software Requirements. These feedback loops could lead to a complete redesign of the entire system and a 100% over-run in time and budget.

To prevent a redesign, Royce introduces an additional step, Preliminary Program Design, before Analysis. The program designer then works with the analysts to sense any consequence from the program design choices. The program designer, an experienced engineer, writes an overview document providing “tangible evidence of completion.” This step forces designers to have a deep understanding of the system. The document also demands the designers to take a position on areas of the design.

According to Royce, management’s main job is to sell the ideas of a more thorough development model to customers and developers.

Royce’s experience has shown that the simple method does not work for larger projects. Furthermore, he contends that the added efforts he proposes are less than the costs of fixing the system afterward during a redesign.

In Royce’s time, getting to operations was the end of the software process. The only remaining task was to sunset a project after it had served its purpose. Today’s product teams take a different approach, iteratively and incrementally building their product while running their own operations of the system.

How Do Leaders Balance System Reliability and New Features?

You can balance system reliability and new feature development! Using SLIs, SLOs, and tracking your error budget, you can determine how to balance reliability and new features.

Using SLOs to track error budgets is a process, not a destination. You cannot create a project to fix your reliability and then move on. The reliability work will never be “done.” System reliability is an ongoing journey. The world will constantly change, and so will the expectations around your services.

Look at software reliability through your users’ eyes. Keep your humans front and center, track the essential properties of your system, and keep iterating.

How do you measure reliability?

For each service that you run, there are many metrics that you could use to understand reliability. We call each metric a Service Level Indicator, or SLI, an indicator of some property of the service. Reliability combines many factors, including availability, responsiveness, dependability, and quality.

For a web application, we may measure the number of requests made, the response times, return codes, and simultaneous requests.

But how do we know how we’re doing? To begin, we have to accept that nothing is perfect, especially in a software system. There will be slow responses, wrong return codes, dropped network connections, and heavy traffic. Things will break.

How do you know know if you’re successful?

A Service Level Objective, or SLO, indicates the proper level of reliability for a particular metric within a service. Setting your SLOs goes beyond what I plan to discuss today. However, it’s a metric you, your team, and your management can develop and agree to manage.

Once you have your SLOs defined, you can measure whether you are doing better or worse than expected.

You can balance system reliability and new feature development!

An error budget lets you track how the service performs against the SLOs over time. You will have a surplus budget if the service is regularly performing better than your objectives. Conversely, if the system’s metrics worsen, you will be in a budget deficit.

Determining the right SLO for your service is an exercise for the readers.

What should you work on?

Now that you’ve defined SLIs and chosen our SLOs, you can use them to determine what to work on next.

When you are in a budget surplus, your service is performing better than expected. In addition, a budget surplus makes introducing new features into the system safe, even desirable.

If you have a budget deficit, your service is not performing as well as expected. Therefore, you should slow down with new features and spend some time improving the reliability of your services.

Using SLIs, SLOs, and tracking your error budget, you can determine how to balance reliability and new features.

It begins and ends with users.

At its core, you must look at reliability through your users’ eyes.

When you are operating software, the proper level of reliability is your most critical operational requirement. Reliability combines many factors, including availability, responsiveness, dependability, and quality. However, before we dig into how we can understand reliability, a few definitions will help.

To begin, we have to accept that nothing is perfect, especially in a software system.

Users are anyone or anything that relies on your service. In the same fashion, a service is anything having a user. Finally, a system is a set of services working together.

This definition may seem circular, so let’s expand with a simple example.

Imagine a dynamic website used to publish online articles. A writer would come to their browser, type in the address, and see a web application where they could publish articles. The web application, in turn, would persist their writing so the writer could come back and edit their work.

The writer would be the user in this case, and the web application would be the service. Similarly, the web application could be the user with a database server acting like the service. Finally, we could go even deeper, where the database server uses an infrastructure service, such as EC2.

It’s users and services — all the way up and down.

How Do Technical Leaders Balance Clean Design and Time To Market?

I was stunned by the comment line I saw in the code.

# I learned this from my CTO :-)
if user_id == 1234
# do some user-specific processing
end

This code was an example of the culture we were building — to ship quickly and constantly clean up after ourselves. A rush of good cheer came to me.

In this case, I was the CTO.

Stop thinking about your solutions as exclusively this option or that alternative. It’s impossible to say there is a best design. It always comes down to the best design for a context. There are always trade-offs involved. You will make different choices in different scenarios.

What’s wrong with this code?

My colleague and I would claim that this code works and solves the problem at hand. So what’s wrong?

There are many sins in programming. For example, global variables, poor naming, and leaky abstractions make code an unmaintainable mess. In this case, the sin is hardcoding — using a specific value rather than an abstraction. Consequently, this code block will run only for the user identified as 1234.

This code solves the problem for user 1234. But, unfortunately, it does not solve that problem for other users, who might be in the same predicament.

What is the right design?

There isn’t a single correct solution. There are, however, solutions that are easier to read, easier to maintain, and easier to change over time.

A more flexible, maintainable design would put the specific values somewhere outside the code.

We could refactor this design to pull these values out over a few different steps:

  • Check against a list of values rather than a single value.
  • Pull the values out into a configuration file.
  • Pull the values out into an environment variable.
  • Retrieve the values from a database.
  • Encapsulate the logic into the domain model, avoiding id values altogether.

Each solution leads to more flexibility in the code and opens the functionality to more users and more situations.

Why is a hardcode the right thing to do in this context?

We were solving a production issue. After triaging the problem, finding the affected code, and creating a solution, we could roll it out with the hardcode within 30 minutes. The more robust, maintainable solutions would have taken several more hours of development and testing.

We stopped the bleeding by fixing the production issue. We took on technical debt using a hardcode to make the fix quickly.

We made an engineering trade-off between time to market and code quality.

How can you have both?

As soon as we deployed the fix, we went back to the code to pay down the debt. We didn’t let the debt linger, accumulating interest, fees, and penalties. We did not allow the debt to slow us down.

As I get older, I realize that upgrading my thinking from OR towards AND makes sense. My head hurts trying to hold two conflicting ideas at once. Yet, embracing the tension leads to creative opportunities.

Stop thinking about a solution as exclusively this or that. Instead, start looking for ways to incorporate this and that.

In this case, we rolled out a solution quickly and followed it with a robust implementation.

While system reliability and bugs make up a part of the engineering workload, it doesn’t end there. The entire company is counting on the team to add new features that will provide value to customers.

How does today’s engineering leader thing about feature work?

Estimates are always wrong. What should leaders do instead?

The model of adaptive project management works. As reality progresses and we learn how long things DO take, we use that learning and adjust the plan. Rather, we change THE DATE.

After all, everyone concerns themselves with THE DATE. Scope and quality tend to be secondary until they are not. On January 1, you say you’ll ship/deliver on March 31. And then, on March 31, you DO ship and deliver. Without the one feature needed by one executive’s favorite client.

Or you ship on March 31, and the software crashes 1 out of every 20 sessions. 95%, that’s an A, right?

Sounds dire. So how can estimating work?

1. Take a catalog of all the features going into the release.
2. Break these features down into stories.
3. Compare the effort needed (guess) of the stories, and assign “point” values to them. For example, I like Fibonacci, 1–8 or 1–13. If it’s bigger, break it down more.
4. Start doing the work.
5. Each week, on a big chart, show:
a. The amount of “points” left going into that week.
b. The amount of “points” done that week.
6. Use that to project THE DATE based on the amount of work left.

Bring your executive team in on a big secret. Estimates are worthless. The planning- identifying the work and breaking it down into stories- holds value.

Estimates for something novel are worthless.

Estimates for something that you’ve done a dozen times can get pretty good. A team that does websites for dentists and has completed 25 such projects over the last few years can get pretty good at estimates. Change the people, and the estimates go back to being guesses. Change the work for that team, and the forecasts become guesses.

Assign relative, “point”-based estimates to work. Track the completion rate for this team in this current situation. Use the trending completion rate to forecast the date.

Knowing where you’re going helps inform the estimate. Keeping quality high and bugs and surprises low helps keep the forecast steady. Choosing less than all the features can bring the date forward.

1. Design with ports and adaptors, putting your heavyweight dependencies around the edges.
2. Put all of the logic is in an in-memory core.
3. Unit test the core. Check your definition of unit. It should map to a use case or a slice of a story.
4. Check the external dependencies with integrated tests .
5. Acceptance tests define when a story is done.
6. Run the acceptance tests in both slow and fast fashion.
7. Using the same tests, plug in and mock certain layers and interfaces.
8. Red-green-refactor. All the way down.

Learning about estimates from construction

I recently had a contractor come to my house and provide an estimate for the roof and siding. Part of the estimate, and part of the discussion, was that if he uncovered some extra work, we would adjust the forecast. As a result, we would move the timeline, and I would pay for the changes.

Even though this crew has done hundreds of roof and siding projects, his initial estimate was based on the size of the house and what we knew. If we learned along the way, the forecast would change.

I didn’t ask him to work late or on a Saturday. They didn’t put in 80-hour weeks. In fact, they arrived each day at 8 am, did their work, cleaned up, and left by 4:15 pm. This approach makes sense. They did not want to get too tired to let mistakes get into the project. They also didn’t want those mistakes to cause them personal harm.

I didn’t ask him to skimp on a piece of flashing. I didn’t ask him to spread out the shingles. I knew that doing that would lead to a leaky roof and more extensive problems down the road. Yet, we ask our engineering teams to cut corners, design faster, refactor less, and skip tests, so that we can go _faster_.

Why do we expect differently from our engineering teams? First, we don’t believe in the cost of mistakes. We don’t think that a non-optimal design will cost us over time. Yet, how often have you carried “technical debt” with you for the next 3 years, into the following fifty features, before the team finally throws their hands up in surrender and yells REWRITE!

The roof crew did learn along the way. The contractor called me several times over two days to tell me what they had learned. We walked through the different options to address the issue. Finally, he asked if I was OK with the change, including a difference in cost.

This is normal and expected in a construction project. So why do we expect differently in our IT projects?

Software projects are like other projects.

Software projects are just like any other projects. We start with an informed estimate of the work involved. The estimate informs whether we want to proceed with the project. Then we start the work. As the work is done, we learn more about the work and we know more about the estimate. Some of the work changes. We bring that in, update the plan and update the forecast. This same process applies to any project, whether a construction project or a software project.

My roofing project was a two-day affair. And the changes I learned about were about an hour of extra work, which the team absorbed during the second day. So the time change was negligible, and the cost change was a couple of percents. Had the crew discovered something really big, that might have added a day to the project, and another 25% to the cost.

Software projects are much more significant. A “major release” could take a quarter or more to complete. Let’s imagine the initial estimate is three months — one quarter. It might take four months to complete the project if we learn things along the way and adjust our plan. This puts us at 30% over! Is that acceptable? If we were to learn A LOT, our 3-month project might take six or even eight months to complete. That’s 100%-166% over. Is that acceptable?

Bring your executive team in on a big secret: estimates are worthless.

I think it depends on the environment and the standards you have. At the bank, we generally ship bug-free software every six weeks on a pretty regular cadence. We had one project where we asked to delay by one week. One project in four years! Those were the standards there.

Who Should Lead Engineering Squads?

Engineers concern themselves with the ill-named non-functional requirements. These are all the other concerns that fall into Engineering, beyond the features in the software.

Performance. Security. Reliability. Monitoring. Capacity. Scalability. Testing. Maintainability. Readability. Operability.

These critical requirements get overlooked when the team leader comes from a non-technical domain.

Ignoring technical requirements leads to lots of incidents, support requests, bug reports. This becomes a codebase that no engineer looks forward to working with, which becomes a company that engineers start start, which results in a loss of accumulated knowledge.

Test the Extremes

Imagine a minimal squad composed of a Product Manager, a Designer, and an Engineer. Let’s remove the roles one by one and see what happens to the squad.

When you remove the Product Manager, you lose out on some of the insight that they provide. The team will stop hearing about customer interviews. Engineers may need to step up and help interpret analytics. Things may go a bit slower, but the work can proceed.

When you remove the Designer, the visual design comes to a halt. The Engineer and Product Manager could collaborate on the user experience. Developers can borrow interactions from other applications. Frameworks such as Bootstrap, Material, Tailwind allow an Engineer to develop reasonable designs. Again, things may be rough, but the work can proceed.

When you remove the Engineer, development grinds to a halt. New features cannot be developed. Code cannot be refactored. Bugs cannot be fixed. Your only option is to bring in another engineer.

The Engineer is the linchpin to a software team.

It naturally follows that someone from Engineering should be leading the squad.

What Product Is The Engineering Leaders Responsible for Building?

I had just joined a new team and asked my new teammates about our architecture. The first question that I needed to answer: What were we building?

Some went into detail on the business split, between B2B and B2C, with a smidge of B2B2C thrown in. Others went into detail about how we used React on the front end. Others discussed Java, and some Go on the backend. Finally, others started to rattle off a slew of AWS technologies, including lambda functions, fixing slow cost-starts, S3 buckets, serverless databases, and Cognito users.

What was I supposed to build?

All of them were right, and none of them were correct.

Thinking through these answers, and putting them together, did help my mental model of how things worked. But, at the same time, I didn’t know more about the specifics of the line of business.

I took another attempt, asking about the use cases involved. Here the answers were as varied as the question about architecture. I was given access to JIRA boards, lists of API calls, lists of lambda functions, as well as some blank stares. Finally, one person corrected me, saying we don’t use use cases but instead manage our work through stories.

I was running into semantics, vocabulary, and culture. All of these are important to how a team functions. Unfortunately, none of these are apparent in a fast-moving start-up environment.

The team had grown quickly in the last 3 or 4 months. Everyone brought with them their ideas and their own way of working. Their ideas come from their experience and their past environments. I was doing the same thing, though probably less aware of it at the time.

As each new team member came in, they didn’t have the same onboarding process. They did not have an onboarding. Instead, they were added to the team roster and given tasks from JIRA.

I spoke with everyone that I could over the first few days. I spoke with long-time team members (those with more than three months of experience) and newer team members. I asked about where they came from, how things are now, and how they thought things should be.

As I formulated my mental model of how things worked, I started to pressure test my ideas. I would explain how I thought the system worked and how we got to this point, and then throw in some ideas of where I felt we needed to go.

I put together my onboarding schedule, learning about the business, the customers, and the technology supporting it. I also put together my cultural agenda — thoughts on how we should be working and how we should be growing.

Rands says, “the process is your product.” The process is how he describes the work of a technical manager. I agree, and I think there is more as well.

The product itself is also my product — the GM hired me. The people (and their growth) are my product — I can tell from their excitement in my conversations with them. Finally, my personal growth is my product — if you’re not learning, you’re dead. (Someone must have said this).

The technical leader shapes the choices and explain the options.

As Engineering leaders, the culture is our product. We are setting the tone for how we work and how we build things and explaining why we make one choice over another. Educating my partners on the business side, the choices we have for delivering software, and why we choose one way over another.

Software Engineering Leadership Today

According to Royce, management’s main job is to sell the ideas of a more thorough development model to customers and developers.

The gist of Royce’s message has not changed over the years. Software Engineers need to understand requirements, analyze the steps needed, code and test their product, and put it into operations. The hardware has changed and the timelines have gotten shorter. The industry has moved to deploying most systems on cloud infrastructure, built on commodity hardware.

Modern software teams have adopted a DevOps mindset. You build it, you run it. Teams are organized as cross-functional units that include Product Managers and Designers. Software is build iteratively, tested continuously and deployed frequently.

Today’s engineering leaders set the tone for how we work, how we build things, and why we make one choice over another.

In this regard, not much as changed about software engineering leadership since Royce published his famous paper.

👏🏻 Give me a clap and “follow” if you enjoyed this article.

📋 About Milo

I am a tech executive, writer, speaker, entrepreneur, and inventor. I’ve been developing software since 1995 and developing teams for the past decade. 🚀

I write articles about software, engineering, management, and leadership.

You can also follow me on Twitter. 🐦

--

--

Milo Todorovich
CodeX

Coaching software engineers to more frequent software delivery | Software Engineer | Engineering Management | Leadership | PhD Candidate