Fighting Conway’s Law: The Reality of Feature Teams

Things to consider when optimizing your org structure

Ofer Karp
The Startup
16 min readSep 15, 2020

--

“Nothing we can’t fix in a day or two” (source: pexels.com)

You probably noticed that every presentation or article about agile software development or about micro services architecture can’t be considered scientific nor serious unless it somehow mentions Conway’s law.

Back in 1967, when “full stack” was still describing a state of a data structure and if someone told you that he is “moving to the cloud” you sadly asked him what kind of terminal illness are you suffering from?, Dr. Melvin E. Conway wrote:

“Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure”

This makes perfect sense. The canonical example in a good presentation about Conway’s law would be an org which is divided into 4 R&D groups and is assigned to the task of implementing a 3 pass compiler, and somehow ending up building a 4 pass compiler.

Translation: “Don’t know, it doesn’t seem critical to me”

Bit more of a real life example that you have probably seen around is an engineering org which is built around physical layers and professions. Such org structure typically combines single function teams and component teams. The R&D in such org may have one group owning the data layer, a second group responsible for the business logic (“the backend”), and a third group continuously reimplementing the application to use the latest and greatest frontend JS framework (“game changer!”). You will most likely also find a separate team of (frustrated) product managers, sitting as far as possible from bunch of warriors going by the name “the Ops team”. Note that the Ops guys will always be sitting near the big window, ready to jump and get it done with. Moving on, in one of the corners of the open space (or even better: off shore) you will identify the members of the QA team, who for some reason are the only people smiling and loving this structure. Last but not least, there are always few people that used to be part of one of these expert teams, and are now “between roles” and are assigned as “project managers” to try get some stuff happen in this awful structure.

As Conway claimed back in the 60's, this kind of org structure creates communication channels that are reflected in the system design, which results in a situation that the system design is clearly not aligned with problem domain and with the needs of the business. In other words, you have an org structure that instead of helping you achieve your goals is just adding massive accidental complexity to every project, and is surely counter productive in terms of building new products and features.

“Maybe. But we do have a well defined release process”

Us being engineers, when facing a problem we always start by (over) generalizing the problem, and then try coming up with a “one size fits all” solution. Introducing: feature teams. Unlike the simple law Conway brought us, they come in many flavours and names: feature teams, scrum teams, product teams, squads, pods. Each flavour comes with its own specific characteristics and agile methodology and it is important to understand the differences, but from org structure perspective it always mean the same thing: let’s take these physical layers and profession based teams, lay them up one on top of the other like a fancy layered cake, and then slice the whole thing vertically to create heterogenous teams.

“See how nice & symmetric it is” (source: pexels.com)

So in the new structure we now have a collection of equally structured multi purpose teams, each with one developer from what used to be the the data group, couple of developers from the former backend group, and a single frontend developer. They are still the same 4 people, but now they all have a fresh job title: “full stack developers”. In addition to the 4 developers, every team needs a team manager (in some flavours being referred to as “master”, which automatically makes all the others slaves), plus one product manager, plus a person that used to be part of the happy QA team and over night became the “automation guy” of the newly formed team. Did we forget anyone? yes we did, the Ops team sitting by the window. Let’s just add one member of the good old Ops team into each of the feature teams. This sounds bit like adding salt to a dish minute before serving it, so in order to balance the tastes and to make it easier for him, let’s now call him “DevOps engineer” and give him a nice salary raise.

We now have the perfect team, so great that it may even be able to build the 3 pass compiler.

“Just say what you need, we will build it”

Such move, known as the “Inverse Conway Maneuver”, is quite common. I have personally used some version of it in 4 different engineering orgs I had the pleasure leading over the last 14 years. Surprising as it may sound, this move does work. at least as long as you aren’t taking it too far. What do I mean by “taking it too far”? good question, as this is exactly what led me to write this post.

Pros & Cons

Since the pros for org structure based on feature teams are very well explained in other articles, let’s talk about the cons. What are you losing with the new org structure compares to the old one?, and when are you expected to feel the pain?

The short answer is that unless you go too far with the change there are no disadvantages, or at least there are known measures to overcome the issues you will encounter as a result of the change. But what happens when you go “all in” and transform 100% of your engineering org from single function and component teams into feature teams? well, the answer starts with “it depends”, but there is a common pattern: the disadvantages will show up around the most strategic areas of your product, and you will only feel the pain in the long term, say 12–18 months into the change.

“Did anyone hear a strange noise?” (source: pexels.com)

To explain it, I want to use 3 concrete examples from 3 different products & engineering orgs I led or currently leading:

Example #1: HP Universal CMDB

“It’s not a bug, you just added the contains arrow in the wrong direction”

UCMDB is a product that provides Ops people a near realtime picture of the topology of the infrastructure and applications they are operating.

It answers questions like what servers are we running and where?, what network devices (switches, routers, etc.) do we have and how are they configured?. It also goes up in the stack to map running middleware services (Databases, App Servers, Message Brokers, etc.), and can go even another level up to tell us which applications are actually deployed on this infra and how these apps are using that middleware.

Having such picture, especially in realtime, is key factor when it comes to being able to setup effective monitoring, for performing impact and root cause analysis when something fails (server, DB, app), and for several other core operational and security related processes.

From pure tech perspective, UCMDB is one of the most interesting products I have ever seen. Imagine a fleet of agents (we called them discovery probes) that are using all sorts of network protocols and application identification templates to identify these entities and map the relationships (dependencies) among them, and then store all this data in a system that allows consumers to query the configuration state of the entire datacenter/VPC and even to register for specific configuration change events.

The collected data represents a configuration topology, and is stored and managed in a graph data structure. Each vertex in the graph represents a configuration item (for example: a Linux server or a DB) and each edge represents the relationship between two configuration items (for example: runs on). Typically, the type of queries that consumers wanted to execute on the system where topological as well. There is very little value in just getting a list of all servers. Where is starts to be interesting is when you can execute a query that retrieves all servers that has less than 8 CPU cores and has MySQL installed on them and this MySQL is used by at least 1 business application which is defined as mission critical app. Think of it as some kind of pattern matching on a graph.

In today’s tech landscape, we have both open source as well as commercial graph databases (Neo4J, OrientDB, AWS Neptune, and many others) that can probably provide a similar abstraction layer to what we needed for building UCMDB. But back then the only viable option was to build such layer. This was a very advanced piece of software, using relational databases as storage engine, and an advanced in memory graph representation with highly optimized engine that can run graph traversal algorithms (from most common BFS and DFS, to more the advanced stuff). We also had a domain specific language called TQL for defining these Topological queries, and a set of sophisticated UI controls that helped users create queries and navigate through the results.

“It’s rock solid, just don’t touch it” (Photo by John Moeses Bauan on Unsplash)

But wait, didn’t I say we are talking about orgs and Conway’s law? how is this relevant? well, the answer is that its an example to a situation where somewhere in the product you have an area with unique capabilities and core IP, and this area need to be managed differently. Why? simply because it’s way more complicated than anything else in the system, and because the implication of a problem introduced in this area is dramatic. And by problem I am not even talking about a functional regression or some bug that causes returning wrong result for a query. Even a small innocent change which slightly increases the average memory consumption of a configuration item (vertex in the graph) might cause a significant decrease in the system capacity and performance. And that would impact many use cases to a level that the entire system becomes useless.

So yes, we love full stack developers, we want all of our teams to have as much independence as possible, choose the right tool for the job, run as fast as they can, have E2E ownership, and all the other great benefits that a feature team comes with. But when it comes to the core graph engine, I believe it’s better to give up these benefits and stay with the more traditional “domain experts team” model.

Example #2: Perfecto Mobile

“Hi Siri, please play Omer Adam”

Perfecto offers a cloud based test lab for mobile and web applications. Thousands of real mobile devices of all types & models are placed in Perfecto’s datacenters in different regions, and customers that need to test their mobile app are remotely connecting to those devices, installing their app, and testing it. Testing is done either manually (interacting with the remote device as if it was in the tester’s hand), or automatically (using standard test automation frameworks like Appium, Espresso, XCUITest). Users also get access to Perfecto’s test analytics platform, where they can use various dashboards and reports to analyze and optimize quality, coverage and CI efficiency metrics.

The main value proposition is increased test coverage. You can now test your app on a larger variety of devices, OS versions, browsers, etc. The devices are always available and ready in lab, both the shining new device model that was just released to market yesterday by Apple, as well as very old devices that you would probably not have in your local test lab, but some of your end users are still calling “my smart phone”.

“Force touch is new? we had it 50 years ago” (Photo by Paweł Czerwiński on Unsplash)

From tech perspective, Perfecto is a unique animal. What I mean is that the type of engineering challenges we faced were slightly different from most SaaS companies. Yes, we did have areas in the product (like the test results analytics platform) which falls under the common type of engineering challenge: building a SaaS product that digest data, process it, and make it available to end users as valuable information. I call these “String in, String Out” products. Building them is an art by itself (UX, Scale) but in this story’s context, it’s a challenge that is very much suitable for feature teams.

So where was the “going too far” point with Perfecto? Once again, like in the UCMDB example, not all areas in the product were born equal. But unlike in the UCMDB case, with Perfecto it wasn’t about a product area with high inherited complexity and risk of fault, but rather an area and a problem domain that requires a different type of SDLC (Software Development Life Cycle). Let me explain.

In order to remotely control mobile devices, you need to have a control layer that exposes APIs for the different types of interactions and potential device states. Things like install an app on the device, lunch an app, click, tap, swipe, force touch, pinch & zoom, and many others user interactions. In addition, for real life test scenarios in modern apps, you also need to be able to inject mocked data to the device, in a way that the tested app would behave as if the data was collected by the device sensors. This includes location based data which is normally collected by the GPS (testing feature like “find store nearby”), injecting images that are typically collected by the camera (testing features like check deposit), injecting audio (testing personal assistants), video (test streaming apps), and many more.

The thing about such remote control API is that the implementation is completely different between the different operating systems (iOS & Android), and in many cases even between different versions of the same OS. To make it even more challenging, the duration of time you typically have between knowing that the next OS version is going to be incompatible with the current version might be very short. Especially with iOS. Although Apple have at least 5 beta releases plus a “golden master” release before the actual major OS release, there has been cases where breaking changes were only introduced few days before the GA release. This is simply how it works, and there is nothing you can do to change it.

“Keep calm, nothing changes after golden master” (Photo by Gerald Schömbs on Unsplash)

To be able to overcome this inherited domain complexity, and still be able to have a fully working implementation of the device control API hours after you encountered a breaking change on the OS side, you need few things: you need a mechanism that allows you to very quickly test what’s working and what’s not with a new OS version (across all device models), you need an architecture that allows you to quickly plug in code that contains a solution for things that are working differently in the new OS version, and most important in our context: you need a dedicated team that is always ready to investigate and fix these gaps. This is not a feature team. Members of this team are domain experts in mobile operating systems, domain experts in reverse engineering, with mentality of combat search and rescue unit.

Example #3: WalkMe

WalkMe’s Digital Adoption Platform (DAP) improves the interaction between people and applications. The way I love to think about it is that we are adding a layer of human touch to machine interfaces.

Many modern applications are like a buffet restaurants. Lots of options, features, capabilities and concepts, all wrapped in windows, menus, tabs, forms and widgets. When stepping in, a client knowns why he choose this specific buffet and he even knows that he came to this buffet in order to have their famous cream-cheese & salmon bagel. But now that he is standing in front of the long buffet tables he has no idea where to find the bagels. In many cases, even when he somehow managed to find the bagels table, he is facing another life changing dilemma: should I grab a bagel and then look for the cheese and for the salmon, or maybe there is another table where they serve the whole prey?

How nice was it if at the buffet entrance the client could simply say “I am here for the cream-cheese & salmon bagel” and get a contextual guidance and step by step instructions for where to go and what to grab at each table. Further more, how incredible was it if the system (the buffet) could recognize the client walking in, ask him “the usual?”, and then automatically run through all tables, collect the ingredients, prepare the dish and serve it to him.

“Thx dad. Finally I understand what you are doing there all day” (Photo by Douglas Bagg on Unsplash)

In a nutshell, that’s what WalkMe does. We provide the buffet owners a clear picture of the experience their clients are getting from the second they enter the buffet until the second they leave. We process these user sessions and translate billions of data points into insights representing different types of users. We then provide the app owner a way to quickly close gaps and reduce friction in the interaction between users and the application.

The way we do this is by adding a virtual layer on top of the UI of the application. On top of any application. Both the one you implemented yourself and is serving to your customers (say in a SaaS model), as well as the marvellous information system that you purchased and rolled out inside your company for employees to leverage and love (Salesforce, Workday, NetSuite, and alike). The virtual UI layer that WalkMe introduces is where you add the “human touch” that I was referring to earlier. Everything from usage hints that are presented as tooltips on the relevant UI controls of the app, to data validations presented as human friendly error indicators with an explanation of what info should be entered, to step by step visual flows that walks the user through the app until he completes the task he needs to complete, and you may even go all the way to implement fully automated flows that can be triggered from a WalkMe menu or from our chat bot.

From a technical perspective, you can probably imagine how the platform I am describing can be mapped to set products and features. The experience tracking and analysis product (called WalkMe Insights) is one example. Sharing the visual step by step walk throughs you created with other people is another example (we call it WalkMe Share). There are many more applications that are perfect fit to the feature team model.

“Each team has a separate release train” (Photo by Rafik Wahba on Unsplash)

But there is also one thing that behaves a bit differently. in order for all the great stuff that our customers are adding in the virtual UI layer to work, there must be a way to link it to the normal app layer underneath. You can’t put a usage hint on an element without first detecting the element. You can’t validate a value entered by a user into an input field without first detecting the input field and extracting it’s value. And you can’t trigger a step by step walk-through that will show the user how to complete a certain task without being able to detect and automate the interaction with the UI elements that are part of that flow.

Taking it one level up, this problem domain can be described as “understanding application graphical user interface like a human”. To some degree, it’s a superset of the problem you probably faced when you tried to implement automated UI tests for your app using tools like Selenium, UFT or Ranorex, leveraging element detection methods like JQuery, XPath and CSS selectors. If you ever tried, you know how hard it is to get it to work and especially how hard it is to make it robust enough to still work when the app changes. In our case it’s even harder than in the test automation case. Not only we need a robust way to detect UI elements in the app, we also need a way to track specific events happening on these UI elements. Otherwise we can’t track the user journey for analytics purposes and we can’t offer capabilities like input validation and contextual hints mentioned earlier. Finally, all this magic must be done in real time and in a computationally efficient way, without impacting the performance and memory footprint of the application.

//div[@id=’will probably change tomorrow’]/span[text()=’will not work in any other language’] (Photo by The New York Public Library on Unsplash)

This may sound a bit like rocket science. Unlike some other “understand XXX like human” problem domains (for example NLU, Image Analysis), when it comes to GUI understanding, there is not much out there in terms of known solutions. Not in the open source community and not commercial. So we had to build our own. It started from acquisition of a company called DeepUI with strong technology and the best experts in this domain, followed by a fairly long research project, and finally the implementation and rollout of a very unique system that is continuously learning the nature of the applications it runs on.

Now, although we call ourselves R&D, we very rarely do the R, and when we do it, we usually fail. I mean it’s usually not that the research itself fails, it’s more like we are getting stuck somewhere between the successful POC and the promised land of production. I have a plan to write a dedicated blog post titled “From quick inception to slow death: The reality of research projects”. But that dedicated post won’t include DeepUI, because somehow our research project didn’t die. On the contrary. It is alive and kicking in production, understanding GUIs of systems that even humans struggle with.

Like most other ML based systems, the work isn’t done when the system goes live. Now you need to tune the models, add features, optimize performance, and cover cases that you probably didn’t know that exists before going live. That’s a huge challenge, in many aspects harder than the research phase and the initial implementation of the system.

In the context of this post, such challenge isn’t a natural fit for a feature team. Not only that it requires strong domain expertise, but it is also that it requires slightly different attitude that is bit less lean and bit more controlled.

From an org structures perspective, this is where you probably want to form a component team, exposing a well defined API. Then the rest of the org can and should be structured as feature teams, each independently building its own features while leveraging the API provided by the one component team.

“Everything looks like a nail” (Photo by DevVrat Jadon on Unsplash)

Conclusion:

Not all problems are born equal, hence not all solutions should look alike. Feature teams are probably the right structure for 90% of your org, but there are product areas and problem domains where it’s worth considering other structures.

It can be an area in your product where the risk of faults is super high. It can be an area in your platform that requires a different development life cycle. It can be a core product capability that recently came out of a research project and is still evolving to its desired shape.

We are always pitching to our teams to choose the right tool for the job. Remember that this also applies to decisions we are making as leaders.

--

--

Ofer Karp
The Startup

Building software that people use and love. Father. Runner. EVP Engineering at WalkMe