Lessons Learned Spelunking

Published in

Bluecore Engineering

13 min readAug 23, 2022

“Rocky formation with rough surface and painting”, ArtHouse Studio via Pexels

Every organization I’ve ever worked for, no matter how large or small, no matter how functional or dysfunctional, has had something that everyone was afraid to touch. Sometimes I’ve been asked to take ownership of such code. Sometimes I’ve been asked to advise.

Recently, I was asked to dig into our file imports pipeline, learn what I could, and provide recommendations to the teams involved. Knowing I wasn’t going to have a long-term relationship with the systems in question meant that I couldn’t follow my normal path — get the basics now and deep-dive opportunistically. I needed more of a plan.

Setting goals

Sometimes I feel like a giant imposter who can’t do anything. Sometimes I feel like a great hero poised to conquer. Sustainability and continuity require a team, and I am definitely just one person. So the first job is setting person-sized goals.

Identify problems, suggest solutions

This was my explicit mandate. Success means learning and applying my experience. Success can’t mean implementing because that is bigger than one person.

Position power can be dangerous. What if I make a mistake? Empowering the teams that are going to own these problems and solutions is the only way to create a margin of error in my thinking. In cases like this, my preferred mode is to use position power to get in front of the right people and appeal to a higher authority if there is a problem later in the process. Flexing whatever position power I have can poison future interactions and creates unnecessary risk.

Make consensus possible

Any group of smart, opinionated people is going to have some differences of opinion. With the time I had, reaching consensus on the myriad of issues was unrealistic. Instead, I wanted to get to a point where the base assumptions were aligned enough that we could unblock decisions for the future. It sounds like a little thing, but everyone’s got a point-of-view and it’s easy to get entrenched. It doesn’t matter how good my ideas are if we can’t move the organization to implement them.

Communicate

I’m supposed to provide a recommendation. There isn’t any position power for me to wield here, so I need to convince the whole organization that what I’m recommending fits the problem, provides a path to an improved situation, and is something this company can do. Those three things can feel at odds when the problem is big.

Picking the medium is important, too. In a small organization you can get away with just talking, but we’re not that small anymore. I needed to produce an artifact. If it’s a smaller problem space with a hard decision, presentations can be great. In this case, where it’s a huge problem space with a million little decisions, a document felt better.

Get others to communicate

Our file import system touches a lot of teams. Part of the problem we’ve had so far is that those teams aren’t talking to each other, or at least not hearing each other. If I can put some people together, maybe those relationships can help long-term sustainability. At least I can at least get everyone speaking the same language.

How did we get here?

Legacy code

A long time ago, I had a friend who had a note on the side of his monitor that read, “Don’t write legacy code.” He explained that legacy code was something that people were afraid to change. Maybe there was a lack of documentation, poor test coverage, no comments, or no contract. Especially in a start-up environment, this can be the norm rather than the exception. If you don’t know that your code is going to survive a month, don’t waste time acting like it’s going to be there in 7 or 8 years. But it might…

Undeclared inbound dependencies

We already know we don’t have a contract, so we’re in trouble. There may also be quirks in declared behavior that something else relies on. Those quirks may not be correct anymore given how the business and other systems have evolved, or may have never been correct in the first place. The worst part is, you can’t know.

Death by a million tiny forks

At Bluecore, we serve several hundred clients. Some have been with us for years (thank you!), some just signed up (thank you, too!). Our file import systems have evolved over that time, sometimes in ways that might break older functionality while adding new and improved versions. The solution to this problem has been feature flags that allow legacy clients to use the legacy code paths, making for an easier transition. It’s an effective solution to maybe the most common class of problem we face — changing the wheels on a moving car. However, the result is that the legacy code persists, complexity increases, and it is nearly impossible to know what is still running and what isn’t.

How to proceed

Who to talk to

The easy answer here is everyone, but that’s impractical. Instead, find subject-matter experts, keeping use cases for the whole system in mind. In my case, there were a few camps of people involved:

Platform Engineering: They own the back-end code. There were factions that had focused on different parts of the problem.
Forward-Deployed Engineering (FDE): They write automations using this system, which makes them both consumers of the back-end and the owners of many thousands of lines of code. If the system is clunky to use, it’s FDE who pays most of the price.
Product: They own requirements and the future direction. What we build can’t do everything or we’ll never finish writing it. Their job is to add good features and cut bad ones.
Design: There are some UX and DX questions that Product is raising, which brought in the design team. Knowing it’s one of my weak spots, I tried to listen carefully to what the designer and the product people were trying to do in this area.
Engineering Management: Cost (both monetary and human), reliability, and sustainability are their primary concerns. That goes beyond just having a system that people aren’t afraid to touch.
Senior Leadership: It’s a big deal to our clients, it’s a big deal to finance, it’s a big deal in terms of allocated engineers. I didn’t have the CEO or CFO on this, but our SVP of Engineering had a real interest. He’s also my manager, so that meant he’d be in the loop whether he wanted to be or not.
Me: I’m supposed to be defining a point of view, not just acting as Chief Software Historian. That means that my architectural goals for the company and all my biases will be represented. Better to do it consciously.

In each of these groups, I was able to find someone to talk to: I needed to get in front of the #1 expert for each functional area. That’s not always possible. The secret is to find several people to talk to. Maybe someone with more availability. Maybe someone where I didn’t have to worry about how my early ignorance impacts my ability to be convincing later. Maybe someone who can just be less formal so we can stumble into new dark corners together. Two invaluable examples:

I have a friend who is a bit of an expert on the code. I sat down with him for two hours to get a dump of his frame for this whole mess. That was a great start that helped carry me through the whole project. I couldn’t take just his opinions wholesale, but I could take his facts and consider his opinions going forward. What he learned had come through a failed project, so having a comparatively untainted messenger was a win for him, too.

Shortly before the start of the project, I happened to do some work with an FDE on an unrelated issue. She isn’t the #1 subject-matter expert with the FDE team, but she knows what she’s doing and was able to explain the FDE workflow. Seeing her pain points helped sharpen my thinking about what a better system might look like. Giving her a sneak-preview of some proof-of-concept code helped validate my assumptions. When it came time to talk to the leadership of the FDE team, I was much better prepared.

Meeting planning

I am lucky that there were not significant political obstacles in this case. Disagreements about tactics and priorities or technical decisions are always there, but I didn’t have to worry about keeping certain people from being in the same room (or Zoom). But even in this mostly collegial environment, there are still rules.

Early meetings should be as small as possible. People will be more open, it’s easier to schedule, and you can worry less about a strict agenda. But the more honest reason is that I don’t need 10 people watching me not know what I’m talking about when I’m learning from an 11th. I want people to be heard, but I don’t want them to feel like they need to sit through hours of meetings where 90% isn’t covering the parts they care about. I feel like I’ve only got a few meeting tokens, so I try to spend them wisely.

The focus on file imports spurred some other people to have their own conversations on the topic. I was in several, but led none of them. I learned a ton! I didn’t have to burn any of my meeting tokens, which was good. I didn’t have control over the agenda, so it didn’t always feel as time-efficient as I might like. In the later meetings like this, I was already nudging the conversation toward my plan, and especially toward my vocabulary (see below).

Vocabulary

Regardless of the medium, I needed a frame. I needed groups to start using the same words to talk about requirements, functions, and failures. Even before I was done with discovery, I started sneaking my chosen vocabulary into the conversation, being as specific as I could. If someone said, “when we make pasta” I’d interject, “spaghetti?” and then let them either correct me or continue. This gave me a chance to get more and more specific, but also allowed any of the experts to tell me if my words were wrong.

I defined some new terms. This was the only secretive part of the process. It was vital that I didn’t accidentally establish words I didn’t want people to use going forward! I struggled. I wrote, erased, rewrote, and erased again. I still don’t like everything I came up with, but it’s good enough. Here were the rules I gave myself:

Terms cannot run contrary to their generally-accepted meaning. I don’t want people to have to unlearn to understand.
Terms should not have a specific meaning in another part of our system. Words like “apply” are going to show up everywhere, but that’s ok as long as it isn’t capital-A “Apply” anywhere.
Avoid codewords and acronyms — we’ve got enough of both to last a lifetime. Jargon creates an in-group and pushes people away.

For this project, my new terms for our file imports pipeline were receive, prepare, transform, split, apply and publish. I’m not going to define them here, but if I tell you that everything we receive becomes a series of events, your guess about the meanings of those terms is probably pretty close.

In my document, I made sure to cast both the existing solution and the recommended changes into those categories. Reinforcing terms at every opportunity is the best way to maximize your chance for adoption.

Is there truth? There is code

Recognizing the value of language is a good step, but it only matters if you’ve got something to say. Consultant Mike gathered all this information so that Architect Mike can come up with something. That involved a lot of listening, observing, and talking. But there’s one more member of this team of Mikes that has been on the bench until now: Programmer Mike. I am lucky that all of the components were written with languages and tools I know.

File imports are way too big for me to internalize in the few weeks I had. That’s part of the problem I’m trying to solve! But I’d be ignoring the most objective source I’ve got if I didn’t take time to read some code. Which code? How much time to spend on this?

I’d like to give the world a perfect rubric, but I don’t think it’s possible. To be honest, I read a ton of code. In this four week, part-time project, I spent at least 20 hours just reading. The majority of that time was spent writing my document. Because a big portion of the document was spent describing the functional units of the existing systems, I stopped to read each significant portions of each component just before I wrote it. This was not efficient, but it was effective. I was able to link my statements about the component to the implementation, removing doubt. In at least one case, a code owner told me that the implementation of the component I was working from wasn’t the one used in production, providing an important correction.

How to talk smack and alienate people

Not all decisions are great. Not all implementations are great. Maybe the best way to ensure you don’t get good information is to tell everyone involved that they’re stupid, that their work is terrible, and that you’re going to show them how to design better systems, write better code, eat their soup without slurping, and generally be better people. If you think this is actually the case, someone should be working on their résumé rather than on fixing systems.

Instead, try to figure out why decisions were made. As an example, I noted above that my friend who gave me the broad overview had been involved in a failed project in this functional area. There was a lot to learn here, including the assumptions that lead a bunch of smart, well-intentioned people to somehow build the wrong thing. In this particular case, the assumption was that there was a reliability problem in one part of the pipeline. There wasn’t, so the additional complexity and cost weren’t justified. But that’s great to know!

There was part of the code that bothered me. It was inefficient and poorly tested. I rage-wrote a replacement. I knew that the author was no longer with the company, so I couldn’t ask him what was up. I was absolutely furious that we would let something like this anywhere near production. I asked around and learned two very important facts: first, the author was an intern who did not get an offer at the end of his internship. Second, this code path was never turned on — the legacy implementation was still handling all production traffic. It should be deleted, but at least it wasn’t hurting anyone.

Lead with empathy. Reserve judgment. You can always get mad later.

Putting it all together

I wrote a document. The reason I chose a document over a presentation is that readers can skip around, click links, and rely on summary paragraphs. Presentations have to be more linear than that. A big reason this mattered to me is that my audience is at all different levels of detail (upper management down to individual contributors), different levels of technical skill, and focused on different areas of this broad system. I would have needed to produce many presentations to serve all these audiences.

My document had an introduction to explain the problem and define my terms. The very next section was a TL;DR (too long; didn’t read) guide. Knowing that some readers wouldn’t hang on my every word and needed to know what the important parts were, I didn’t want them guessing what the important parts were. The TL;DR guide explicitly told them what to read and what to skim or skip.

The conclusions were specific and written as a bulleted list. There was some commentary in that list, but it was as tight as I was comfortable making it in the time I had. It’s important to answer the “so now what?” question as clearly as possible. If I didn’t, someone else would and I might not like what they choose.

Because we use Google Docs, it was easy to share to everyone with comment access. Lots of tools support this kind of thing. I wanted feedback, both private and public, on what I’d written. Most were requests for clarification, which I was happy to give — if one person doesn’t understand something, odds are there are more.

Release

I had planned a big reveal. I was going to invite principals from all of the stakeholder groups and unveil my masterpiece like some kind of magic trick. However, I also wanted to share the document with some of my experts before release so I could get their feedback.

It leaked.

I was a little let down that I didn’t get to pull the rabbit out of my hat or saw someone in half, but it was worth it. Everyone was bought in before the general release, so no theatrics were necessary.

What you have to do here depends completely on your audience.

Follow up

I am not a consultant. The document represents the end of a stage of analysis, but I’ve still got a responsibility to make it happen. Many co-workers of mine are impacted by this work and will continue to be for months. Helping engineering teams understand is part of it, but the more important role is probably in advocacy. None of this will matter if we don’t have the time, space, and focus to implement. It’s always hard to keep organizational focus on long-term things and the only antidote I know is for there to be a tenacious person who steps up. At the start, I am that person. Handing-off that role properly is at least as important as handing off the technical information. Failing to do so is a guarantee that we never get to the end. I have to find a new advocate and be that advocate until I do.

Summary

Understand why nobody wants to touch that thing
Define the goals for your project
Do you just need the answer and you can drive the solution, or do you need to bring others in to help drive it?
Choose your medium based on your audience(s)
Be honest and humble about what you don’t know
Be intentional about who you talk to and how you talk to them
Early meetings can be small and less formal
Be empathetic about prior decisions
Make everyone speak the same language; invent new vocabulary if necessary
Be ready to advocate for your solutions, even after your analysis is over

Mike Hurwitz is a Principal Software Engineer who has been at Bluecore for four years, mostly focused on infrastructure for data science. He is about to celebrate 24 years as a professional developer.

Interested in helping scale our platform as we continue growing? We’re hiring at Bluecore!