Lessons on Delivering a Large Project

Jeremy Song
Stochastic Stories
Published in
4 min readOct 5, 2023

I recently led teams in my org to complete a 2-year project that re-architected a big data processing platform. I gave an internal talk about some lessons we learned in this project. I am sharing those lessons here as I think they might benefit engineers outside my company too.

Incremental Delivery

One of the things we insisted on at the very beginning of the project is incremental delivery: we want to make sure we deliver small features frequently at pace. Looking back, this was the right decision because incremental delivery improves morale. We, engineers, are builders and like to see our work make a clear impact as soon as possible. Nothing is more rewarding than seeing one project completed and putting us closer to the finishing line.

Incremental delivery also gives leadership the confidence that we are making meaningful progress toward our goal. In this project, we use a single, easy-to-understand metric to track the progress. And we present this metric in quarterly business reviews with leadership. I rarely got asked when we would fully launch the project because leadership is able to see the number moving in the right direction every quarter.

Lastly, incremental delivery allows us to benefit from the new system sooner. In our case, since last year, the workload on the legacy system has increased by more than 100%. Thanks to incremental delivery, we were able to shift most of the workload to the new system. Without incremental delivery, the legacy system had to take all the workload. We either had to scale up the old system even more, which we knew was approaching its scaling limit or had to throttle our biggest customers.

Software Correctness

In this project, we spent almost 30% of the time just to ensure the software correctness. Due to the nature of the project, a subtle bug in the code generation would be very difficult to detect and could lead to a huge customer impact.

We know that ensuring software correctness and eliminating bugs is difficult. But we made lots of progress on that front. We used lightweight formal method and shadow testing. We borrowed the lightweight formal method from this S3 paper. We developed a shadow testing framework so that every data processing job will be executed on old and new platform and compare the results. The result is that we rarely had any bugs in production since the launch.

The lesson here is that traditional unit test is not enough. Property-based testing and lightweight formal method is the new unit test, and the industry is already moving in that direction. If you look at the newer languages like Rust, the Rust community developed several property-based testing, and formal verification libraries and tools. Similar property-based testing libraries also exist in popular programming languages such as Java, Python, and Scala. I strongly encourage everyone to start using it.

On re-architecting

In this project, the initial goal is just to move the workload from the old platform to the new one. Typically, we would choose a lift-and-shift approach. But we asked two important questions: 1) what do customers really care about? 2) what are the use cases that have not been served well?

So we worked backwards from customers. We talked to our customers to get answers to those questions and we used those answers as our North Star to build the new system, which ended up drastically different from the old one.

Innovation should be every software development team’s core value. Don’t be afraid to spend time innovating. You need to work backwards from customers to understand the use case, you also need to read papers, watch talks, and learn from other people to get inspired on novel solutions. Always remember that we owe our customers to innovate on their behalf at pace.

People/team

When people talk about building systems, they typically talk about software and hardware. But re-architecting is not just about changing software/hardware. It’s also about re-architecting the team. In many software companies, the organizational structure always mirrors the system architecture. In this talk, VP/Distinguished Engineer Andy Warfield from AWS talked about how Amazon S3 org structure closely mirrors the S3’s system architecture. When system evolves and the discrepancies happen, the org will spend lots of effort to fix that. That’s just the nature of software.

In this project, we did two things to make sure our org structure mirrors the system architecture. We create a virtual team so that we can move extremely fast. A virtual team is a team within a two-pizza team, but we had our own standup progress and different way to track the project progress. A small team allows us to make two-way door decisions very fast.

We realized that there should be a dedicated team to take ownership of one important component in the architecture. So we engaged with our leadership and did a lot of work to make sure there was a right team to own it. We then helped that team to create a roadmap and get them to develop that component.

The lesson here is that, for large projects, if the org structure does not map to the system architecture, it will impact delivery speed. So work with your leadership to fix that.

--

--

Jeremy Song
Stochastic Stories

I am currently a Principal Software Development Engineer at Amazon. All opinions are my own.