A Saucerful of Technology Secrets (part 1)

Massimo del Vecchio
10 min readSep 24, 2019

--

Few years ago I had the chance to lead the engineering department of a company whose business was in big trouble due to technology: they could not release new products and services on the market and had a lot of problems maintaining existing ones.

I came from consulting & system integration and it was a great step for me moving from “this is what you should do” to “let’s do it”. This changed my professional (and personal) life forever.

The moment I stepped into the CEO office (the exact day I started my new job) he just told me: “I believe we have good people but a shit technology. You need to help me understand why”. Here we go, rock’n’roll from day one, that dreamed Caribbean island was still very far.

The moment I left his office I was completely sure he was wrong. You cannot have good people and shit technology. “It’s all about people”, I used to say to my clients when I was leading IT programs on their behalf. I was never so wrong.

This story is not about people but about technology. And how people may screw it up or make it great.

Being brave is not enough

Let me first give some context.

The company was serving about 100 large customers around the globe. Their market model was B2B2C, enabling their customers to deliver mobile services to hundreds of millions of users.

Across their platform, customers were able to run marketing campaigns (across mobile channels), manage subscriptions, provide few financial services (loans, mobile money…), collect payments…

All platform components were deployed into customer data centre on virtual machines. No containers, no cloud for production services.

The officer in charge before me thought it was time to evolve their monolithic Java/J2EE platform to a distributed Java Spring set of applications. He bravely started a parallel development to migrate everything in one year.

After 2 years of development they had migrated only 20% of their customer base and their cost of ownership was basically doubled.

They had to hire engineers to build new capabilities and fix bugs on the new platform, plus keeping the existing full team to maintain legacy ones.

Operation team was actually managing in parallel 2 platforms on multiple clients, with lots of manual activities on deployments and monitoring without any documentation.

Clients started to leave due to poor operation quality: number of issues and time to solve increased 10 times.

People were spending their days ping-ponging tickets across departments and complaining about someone else not doing his/her job.

Attrition was at its peak since the company was founded: people working under huge pressure and without any sign of light at the end of the tunnel, just left.

We were losing relevant skills and years of knowledge on the job.

If you really spent years managing large IT departments, all above will sound really familiar. If not, you are lying.

Break Pandora vase from the bottom

I spent the first two weeks meeting with everyone in the company.

I didn’t miss one key guy in the whole Engineering, Operation and Customer Support departments.

I was looking for the root cause of the problem with the objective to define a strategy to get out from the hole the company was going to be buried.

One by one…

Software: Solution was quite good. Despite the “feeling” of other C-Levels, the architecture was well designed. Each component of the new platform was self-consistent and integrated across RESTful APIs to all other components. Coding was good. I didn’t review the entire codebase but what I picked was good enough. No evident antipatterns.

Some components were actually a bit over-engineered for their scope of work. But that was good to scale the application in future. Our CEO was right, despite very few exceptions the quality of developers was really high.

But, wait… one alarm bell: no unit test coverage and no automated test framework. That was a big miss.

IT Operations: we had very experienced people, some of them with engineering background. They were able to write down their own code to manage monitoring and deployments. They were on the business with the legacy application for years and with great results.

On paper their resolution and escalation processes were ok: they had the right tools, the right people, the right process (remember “on paper”).

Problem #1: ops people didn’t have clues about the new platform. The only way to learn about it was: “ask engineers”.

Problem #2: legacy platform had a dedicated branch for each client installation (i.e.: about 100 branches). New platform partially solved this problem (we will see how in the deployment process).

Customer Support: almost the same as operations, high quality people. The very issue was that about 50% of the team was hired in the past 3 months. They didn’t have the full history of the clients, the full understanding of the product and solution.

Bottom line: both operations and customer support had a very basic knowledge of the new platform. No documentation available and no one from engineering dedicated to support them on closing the problems.

Delivery methodology: despite running every day a stand-up meeting, the team was actually not working across scrum.

A product manager was giving design document every month and development team was working on them in 4 weeks cycle.

Stand up meeting was just a coffee break between developers and team leaders, they used to tell each other what they already knew.

They didn’t set the objectives of the day and even when they did it, it was well known it was going to fail. They simply didn’t take it seriously.

Developers had their design input from a product manager, they started to write down bunch of code and checked in every week into their repository. Then one week before the release the product manager was running acceptance test. A full waterfall with agile label.

Deployment process: to understand the complexity we need to deep dive into platform features. I will try to be very simple.

The platform required dedicated configurations for each client. Such configurations were changing the marketing campaign behaviour, subscription flow, payment options… Basically the whole user experience was defined across a set of parameters configured into “yml” files deployed into customer site.

Each account manager was in charge to “approve” the deployment each time a new set of capabilities was deployed into customer site.

They could ask any moment to operation team to change such configurations. We are talking about 1000s of un-documented parameters into text files. Doomed to create issues.

On paper the solution was brilliant: one single platform to be just configured on site by IT operation. Brilliant becomes dumb if the number of configurations becomes huge and no one has a real clue of what all of those parameters can do.

Plus, to make life even worse: some key clients asked for specific patterns that required software adjustments. Thus, bunch of configurations not even working with all clients.

This led to a full set of manual checks from business operation during deployment.

Weekends were actually spent with people chatting all around the globe. Typical conversation was:

IT ops pal: “I deployed the campaign engine; you can now run the test campaign”.

Business analyst: “cannot login”

IT ops pal: “sorry, didn’t re-activate access, try now”

Business analyst: “Thanks, can do it now”

Business analyst: “I have created my campaign but do not see it into the dashboard”

IT ops pal: “don’t know what you are talking about, will ask engineers”

Engineer: “Dear IT Ops pal, you need to deploy all components, you forgot the cache cleaner”.

IT ops pal: “Please add it to deployment procedure next time”

Engineer: “It’s already there”

This conversation could have lasted for hours or days. The result was:

· Business analysts/account managers feeling they were dealing with idiots using an esoteric vocabulary.

· IT ops hating their job, which was basically: “follow up hundreds of useless commands that could be automated very simply”, i.e.: do a job that a very simple software can do.

· Engineers believing the whole humanity was a bunch of useless people whose objective is to screw up sunny weekends.

This was happening few years ago, when Microservice, Jenkins, Jira, Docker, Cloud Native Applications… were already there, some of them not fully mature but ready to be adopted.

I bet this is happening in a lot of companies still today.

Last but never the least:

Organization: engineering was designed around functions. They heard about “autonomous teams” but de facto developers were depending from dev-ops, quality assurance… belonging to different managers.

Senior managers yearly and quarterly targets were almost aligned, but there was no alignment on weekly or daily targets.

So, everybody built his own agenda and we all know how much is easy to cheat a metric at the end of the quarter if you do not link to measurable business metrics.

Operation was designed around products, actually dedicating people to key accounts and sharing some of them on minor ones.

Customer support was designed around customers.

So, three legs of the organization (Engineering, Operation and Customer Support) with three different operating models.

Is it that simple then?

It’s not that simple but sometimes it looks that simple

Whenever you find yourself into a bunch of issues like this, the real question is: what is the root cause? What’s the priority one problem you need to fix?

Finding that answer is not at all easy. Sometimes we crawl around symptoms without really catching the cause.

There were some obvious technology and methodology solutions to the problem: the late feedback during product development was a clear sign of bad scrum implementation, the lack of automated testing during continuous delivery pipeline was preventing valuable check-ins, the lack of documentation on huge number of configurations was killing operation and customer support…

All of them were just symptoms. I just started to write down some cause-effect chain, that resulted in the following evil loop:

So, it looks like fixing the delivery approach would have solved the whole problems.

But let’s stop for a while on that: why a team of smart people should pick the wrong way of working? Why they just ran across the same mistakes for years without stopping, taking a long breath and trying to fix it?

I found the answer in the chain above:

“Lots of delivery errors → Teams blaming each other”.

The question is: why someone facing an error should go for the blaming path instead of fixing it? Why the first step of someone facing an error should be: “push it in the hands of someone else”?

That’s a cultural problem.

But who created that culture? I know just one case where good people do bad things, it’s called “religion”. Maybe those people were religiously following unwritten rules or leaders driving them in the wrong path.

Yes, that was the answer I was looking for: fix management behaviour, you will then fix the rest.

Get into the fight, Genchi Gembutsu my friend

I wrote down my priorities:

  1. Fix the blaming game across a cultural shift among leaders
  2. Re-organize engineering teams
  3. Review product backlog including local configuration review and deployment automation
  4. Fix the scrum meetings, make them effective
  5. Implement a real CD pipeline, including test framework for automation: (5a) Teach TDD approach to developers, (5b) Fix configuration management, (5c) Build/reuse continuous integration pipeline, (5d) Design and implement testing strategy that includes automation
  6. Re-organize the operating model between engineering, operations and customer support

I am not saying this is the priority list to solve any tech organization problem. The influencing factors on your priorities are a lot and may change by the context, by your attitude, by what you believe is feasible…

For example, I picked last the operating model re-design because I knew it was the most complex and required a lot of time, plus I could have got evident results without dramatic changes. At the end of the day you cannot step in a new company and say: ”I want to change everything!”, without bringing any tangible results in the first place. Actually, you can do it very easily if you buy the company.

Fixing the cultural problem was quite fast. My own method is “care about other people job”. I mean: if operation people are struggling on a deployment problem go there yourself, understand the problem, do not step out just because it’s not your area. Show this attitude to your people.

Remember, you lead engineers that are born to solve problems.

If you find out you cannot do anything to solve it, start thinking (and pushing into your backlog) what you can do to prevent that problem in the future.

Be careful, I am not saying “show that you care…”, I am saying “really take heed…”.

Feel like yourself in charge of everything. People will notice it, and, most important, your co-managers will recognize it.

Obviously, you will find someone trying to play smart and push useless stuff on your table, that’s human nature. But use logic, prioritize, make your software great again! (Sorry for that my dear US friends).

The second step is: push yourself in the critical path. Look as much production tickets you can, deep dive why there’s a problem there, do not try to find someone to solve the specific issue but find a solution to prevent the problem to happen again. This is a mindset, not only a methodology.

That will pay back a lot. People will start trusting you and your organization ability to “make the stuff happen”.

And on the other hand, you can’t imagine the value you get really looking to other managers problems: operation, customer support, account management, marketing… From an engineering point of view that’s a gold mine to produce better solutions.

So, bottom line: dress other department shoes and help them solve and prevent their problems. As an engineer, that’s what you are paid for.

This is another angle of the Toyota “Genchi Gembutsu” principle: go and see. Do not delegate. As a manager you need to know what happens, see it with your eyes. My suggestion is: do not Genchi Gembutsu only around your organization but walk the entire company processes.

That’s just the first step. In parallel you need to actually fix stuff (yes, it’s not all about being a nice boy!).

In the next article I will focus on how we actually started to fix our stuff: team organization, backlog prioritisation, planning and execution.

--

--

Massimo del Vecchio

I have traveled everywhere, met a lot of people, shaken a lot of hands and still found the time to do my nerdy job.