Leadership Diary — A Quarter of Culture Change

A series of blogs on learning and practicing leadership.

This is a follow up from previous learnings. From last year December, I changed how the team is operated. Instead of having a dozen OKRs, I directed the team to strategically focus on only two goals for the first half of this year. I also set up two feedback mechanisms for the team: (1) monthly reflection (followed Netflix’s start/stop/continue format), and (2) disciplined monthly team update (what is DONE DONE DONE, I let the team know their performance will be evaluated based on the team updates). With these two changes, I am able to view the team in a different pair of lens.

Tons of hidden issues we didn’t even know last year started surfacing up, and the team is able to iterate fast to resolve them. Looking back, without a focus on fewer things, there are two problems with it.

  • One is on the deliverables I discussed before, you can’t actually deliver any meaningful product (a product that is not DONE 100% is not meaningful) without a focus.
  • The other problem is you don’t even know there is a problem. You can’t have a good pulse of the team’s performance if there is no focus, and you have no chance to improve if there is no proper feedback collection mechanism.

In the past, we come up with half a dozen to a dozen OKRs at the beginning of each quarter, and the more OKRs we have, the more proud we were. Naturally, there are long term strategic tasks and short term tactical punch list tasks. We did a very good job fixing the punch list items, however, we missed a lot on the longer term ones. What is funny is, those smaller punch list items normally get a lot of applause from the customers and we celebrated those. For the longer term ones, they are harder to accomplish, and we didn’t really pay enough attention to then, not even progress check — it is longer term, we will get there, one quarter of delay, it is ok.

Now when I look back, we found a majority of our team was spent in supporting customers, on a very superficial way. We were tuning their jobs to make it work without understanding how things work under the hood, like IT support.

We have a theme to optimize the I/O performance for TensorFlow jobs, and it is surprising that we had celebrated a few times of 3x improvements. However, many customers are still whining about their jobs spending 90% of time doing I/O.

From Dec 2020, I started the changes with the TF sub-team. We set up a clear goal for the squad of 4 people, to dedicate in improving I/O, with monthly reflection sessions.

After the first month (Dec), in reflection session, we realized we delivered almost nothing:

  1. We didn’t even have an agreed KPI. Members in the team went back and forth on the KPI definition for a month.
  2. We still didn’t have a good measure of our KPI even though we started with the goal to build the benchmark to evaluate the training performance.
  3. Our partner team has no idea what we are doing.
  4. Some team members complain folks are working on same problems sometimes and we’re duplicating work.

After talking to the team, I realized the lead of the project was burning out and became a bottleneck for the squad to function: (1) I thought the person who led the project has already mastered the I/O path of the framework, however, he is not there yet. He still needs time to learn the code. (2) Because I assumed the project lead is technically ready so I stretched him to take more responsibilities on project management and serve as POC to communicate with other teams. This past month really burnt him out.

The result is he can’t either code well or manage the project well, everyone is frustrated. We decided to change project lead to a more senior member in the squad.

In February, we regrouped and reflected for January. January is a shorter month with 3 weeks effective work. We found everyone is happier after we changed the project lead. Our partners also got more frequent updates, even though still a bit concerned of the progress. The person who led the project in Dec is also happier since he finally got more breath room to really look into the problems and code. There are a few other tactical problems we identified during the reflection and we created action items for them.

At the beginning of March, we had another reflection on February. The reflection on Feb exposed many more fundamental issues in the team. The original plan was to deliver parity of the I/O work (about 3x improvements compared baseline) by mid March, by end of Feb, we found ourselves in a quite desperate state.

We had nothing close to be shipped, we barely finished a POC to improve I/O performance and it would take at least another 2 months finish the project.

We still didn’t have a reliable way to measure our key metrics. The project to create a benchmark for performance was poorly done. It is a “manual” benchmark suite that we had to manually login to 10 boxes to run the benchmark, and the results coming out of benchmark is not self consistent.

During reflection we found four fundamental problems in our team:

  1. Team in general still lacks expertise.
  2. We don’t have the gut to solve unknown problems, we call hard problems mysteries.
  3. We sacrifice quality to chase deadlines.
  4. We have a goal but team members have no clear direction and priority.

These problems really surprised me, and freaked me out. I had a few tough discussions with the team on these issues (I didn’t handle all the conversations well, which I will discuss more on in next episode lead with compassion). Good news is, even though the feedback was direct and tough, the team took the feedback really well.

For (1), we started a weekly deep dive session and asked a team mate who has a better understanding to give deep dives on TF internals. I also asked everyone to give deep dives if they tackled challenging problems. The deep dive sessions should discuss the topics from high level design of the component, down to the source code.

For (2), we added a principle to our reservoir:

No mystery, no magic. We are diehard infra engineers, not framework shoppers. There is no magic/mystery in front us, just source code. We have the confidence to solve all the infra challenges bounded by physics.

For (3), we added another principle:

Quality > deadline. We never sacrifice the quality to meet a timeline. If we hit a dilemma of building a crappy product to meet a deadline versus building a quality product but miss the deadline during execution, we always choose to make the product great. In the meantime, we communicate the delay timely, learn from the delay, and work hard to catch up.

For (4), we become more disciplined in managing the JIRA tickets and create deliverable milestones (as epics). Before start a milestone, we make sure everyone has clarity and is convinced we are doing the right thing.

Now it is May, and we had another reflection (we skipped the one for Apr because of company break and a hackathon week). We observed a clear cultural change and improvement in productivity.

  1. Team feels more confident solving unknowns. Instead of calling a problem magic, members in the team now started diving into the source code to understand the problem and contribute to open source actively.
  2. Team members have more clarity on what they are working on, what others are working on, and we resolve problems faster.
  3. We are delivering our results, milestone by milestone, and we are on track to deliver our commitments.

Setting up the right principles for the team is essential. There is no fixed set of principles/values that fits every team or company, but there seems to be a universally applicable path to build the foundation to gradually add principles.

First, candid feedback. Set up a culture to consistently reflect and provide candid might be the most critical building block of a team’s culture. A team can only improve when everyone in the team keeps improving themselves, and helping each other grow.

Second, focus. Get fewer things done better, tactical perseverance won’t save strategical laziness. As a lead of the team, you have to spend time figuring the mission, and the 2 strategics your team should focus on now.

After these two, you and the team will have the right mechanism to scale. Then it is just a matter of time to iterate and build the most suitable set of principles to continuously course correct and grow.