Since leaving SocialChorus I have been doing an odd combination of management consulting and on the ground software design consulting. I have been doing what I am good at, fixing organizations.
Almost always a desperate request comes to me focused on code. There are no tests, or test are unreliable. When the code is changed, it has unexpected bugs far from the source. No one feels safe making changes big or small. The file, class and method sizes are incredibly large, and the coupled dependencies are unknown. The team is stuck and someone in the management chain has asked for help.
The way that I work on these problem is very hands on. I pair program with people on the team. We start with what they are working on, but I am answering some bigger questions:
- What does it take to setup the application(s) on a dev box from scratch?
- How do developers know what they should be working on?
- What do the stories, bug reports or cards look like?
- How do engineers get through a task?
There is a tactical recipe that is easy to teach:
- Put a section of code under test you can count on.
- Extract that section into one or more classes.
- Write unit tests for the new, small, easy to understand classes.
- Rinse and repeat.
Usually we can carve out a vein of cleanliness in a couple days, and that vein will be the river in which our feature or bug travels. We don’t take on the world, just the world that we are in for our story.
If code were the problem, it would be quick work teaching these techniques.
Even when engineering management has escorted me in with a blessing, the resistance on the ground is intense. There is incredible fear. In the past, any change to the code has been dangerous, and not just in the code but in the organization. And that is where the work gets interesting. I climb the ladder from engineering practice to product process, finally arriving at business attitudes. Invariably, I climb with questions all the way to the founder/CEO where the unintentionally, fear-generating attitude originates.
Whether any lasting change can happen in the code usually depends on which stakeholder brought us in, and whether attitudes at the top can change.
This month I worked with two companies: one who I am really hopeful about, and another who I am not. I am going to write a bit about each, starting with the lesser and working towards the greater, with optimism. I think they are interesting stories, but if you are in an impatient mood, everything has already been said above.
Part 1: Simon says
As I said, usually when people want my help it is in the area of code, and later I need to convince them that the problem is higher up. In this company, what they thought they needed was vetting of an architectural plan. So, we started our engagement with a meeting where we looked at a diagram. On the bottom was a cloud with a database and some other stuff, on the top was a series of clouds that also had databases and other bits. There were arrows that connected the upper cloud to the lower clouds. They presented this without explanation and looked at me expectantly.
The company was on the unlucky 13th floor of a downtown corporate building. We sat with spectacular views of the SF Bay, and could see down to the wholly silent hustle and bustle of Market street. The silencing of this usually vibrant, sometimes cacophonous street strangely mirrored the silence of the meeting and the silence of the engineering culture in general. It was very hard to get any information out of the VP or the team. It was hard to know exactly what they were trying to express with the squiggles and arrows. I had no idea what they were proposing, and though I am quite experienced as a consultant, I have never found myself in this kind of a situation.
I have an amazing friend, who despite her talents was reluctant about the consulting direction her life was taking. She said to me, “But what if I show up and they want me to make architectural decisions on the fly without knowing anything about what their problems are?” And I told her flippantly not to worry, that this scenario just didn’t happen. Yet, here I was in her worst nightmare, something I thought unimaginable.
The meeting was also odd in that there was one person who called in, yet was only blocks away in his condo. He would turn out to be key.
After much digging, I found that the team could not make a diagram of their current application architecture. They didn’t have a ready sense of all the clients calling into their primary app or the sub-services at play.
The original mystery diagram was a proposal for a way to spin up many instance of their primary app, cloned but mutated. The upper level of cloud were these clone applications that were different in that that they had read only database caches and did all their writes through a message queue that would eventually hit the primary app via an API.
I couldn’t imagine what they were trying to achieve with this complexity. This isn’t the way that applications scale.
It took most of a day to get them to admit that their real pain point was around the primary application code base. It was a Rails app, where the User model had 3,000 lines of code. The application controller was of a similar size, with six different ways for a user to authenticate, all which were multifaceted. There were few tests. The business was wildly successful, and the app was experiencing unacceptable downtime. The didn’t know precisely why, but they thought was database related.
This diagram of clouds and arrows was a hope and a prayer at addressing downtime with a complicated caching layer, composed of Rails applications that were just as unknowable as the underlying template Rails application.
It is easy to imagine that this was the work of a great many sloppy engineers, who were not as smart as the rest of us. That has never been my experience. Instead I see hard working developers who want to figure out the right thing, the right way. And so it was here. The hands-on worker bees wanted to test and refactor and clean up the code base. They were skeptical about the diagram. They were also loyal and followed their manager. Brilliant software engineers will follow bad plans put forth by people in power.
Let’s talk about the organization. We were brought in by the VP of Engineering. He was not a man to touch any code at this stage of his career. That’s OK. He was also not of a disposition to listen to his team of code makers. That is not so OK.
I sometimes see teams with an architect who designs the system, but never codes with the team. As a result, no parts of the software system communicates with other parts. It is all theoretical, and when the developer rubber hits the road there is only bad friction and skidding. So it was with the VP of Engineering. He wasn’t able to hear from his team that this wasn’t a viable plan. So his team tiredly rose to the challenge, trying to imagine a way through.
Let’s talk about the man on the phone in the first meeting. It turns out this was the CTO and founder. He was to the VP of Engineering, what the VP of Engineering was to the developers. He was someone distant from the problem, dictating an impractical solution. The engineering culture was a three layer cake of dysfunction, where everyone down the chain had to execute what they knew to be an impossible task, at impossible speeds, perfectly. It was like the games of Simon Says and Telephone combined to bad effect.
Since we were brought in by the VP and unable to get audience with the very busy CTO, there was really nothing we could do.
Part 2: Blame gaming
This month I spent much more of my time in a company who grew up and became successful without a technical co-founder. Technical companies without a technical leader often treat their engineers like the wheels in a car, saying ‘faster, faster’ without thinking about the engine. It is where the unrealistic sales promises crunch into the reality of time. Engineers suffer impossible schedules and 24 hour pager duty. Retaining engineers becomes harder as developers see their doppelgänger at other companies with a real seat at the table. It is not so much the quality of the engineers that is lessened in this kind of a culture, but the bravery and sense of self-worth. Risk taking goes down, and the best practices that the greater community talks about seem like fables from another land.
Some how, and in some way, a CTO was just hired to sort out the problems around velocity and stability. This new leader was also put in charge of product, which is critical. We were brought in by the CTO to pair and teach and figure out what was wrong with the process.
Just like with the first company, the code was fragile. Any change in the under-tested code would result in remote failures far from the source. Their strategy was to change as little as possible, and have code reviews with three other people on the team enforcing this conservatism. In addition, they developed local experts, who acted as guard dogs to fenced areas of code. These areas were unofficially owned by them, and the rest of the team stood back heeding their warning, and warping the product to avoid entry into these arenas.
There was a huge fear that developers lived with daily. I thought this fear was only about the fragility of the code. I thought management had given the directive to proceed with caution, always. It wasn’t that direct.
Two weeks in, I encountered my first code ‘crisis’. A deploy we were part of caused background jobs to be only partially available. Jobs slowed in production, and several hours later six engineers and ops persons were trying to figure out what was going wrong and why. It turns out that our deploy scripts were unreliable, only having deployed to half the machine. In addition, the integration tests we were depending on gave us repeated false positives, giving us the optimism to believe our code worked. The service we were working on was a hand crafted Sinatra app, with different initialization processes for workers vs the application. Our code change had inadvertently jeopardized the worker initialization without affecting the application code at all. There were no tests for this worker code, or its initialization process.
With so many moving parts, moving so wrong, I imagined we would mull it over at a retrospective. I wanted to do a 5 whys analysis as a team to drill down into the business values that built these processes. The company had their own process for the incident. The VP of Engineering picked our commit as the cause, and asked us for a public post-mortem. While there was language around learning from mistakes, what was missing was the team ownership of the problem, process and solution. It was hard for this to not seem like a scapegoating process, and I felt this subtle blame culture, was the source of so much developer fear.
Later we learned that the VP was also living in fear. The cofounders had been playing a game of product ‘whack-a-mole’ following any customer request attached to revenue. The spread of the product, and the lack of prioritization meant that developers were always rushing through features and encouraged to postpone tests or delay refactors. The company leaders regularly expressed doubt in the engineering team, eliciting a fear that trickled down to every member of the team.
What is remarkably different about our effectiveness at the company is the new CTO. I don’t know what prompted the company to put a seat at the table for technology, but engineering now had an advocate who advocates and addresses the root cause of the engineering fear.
Because … the code is just the symptom.