architect your org before you architect your system

A lot has been written over the years about Conway’s law. Melvin Conway introduced this idea in 1968. He posited that software reflects the organizational structure which produced it. More specifically, the software system models the communication structures of the organizational system. This is very important when it comes to systems architecture.

For a long time, our industry architected systems based on a very large requirements document from a customer. The entire architecture would be determined up front with lots of pretty pictures and detailed descriptions. The architecture often took into account the different divisions or teams needed for the project. The work would then be divided up for each of the teams to complete. These teams often worked in silos of communication where all communication had to go up to a certain person or system and then back down on the other side.

The most obvious result of this type of organizational structure in software architecture is the Enterprise Service Bus (ESB). In this architecture, all communication goes through a central system which translates the message if needed and sends it back out to the other side. We quickly found that an ESB became a bottleneck as the number of transactions and applications increased. The same observation could have been made of the communication structure of the organization. Project managers and architects often become a bottleneck when all information must be funneled through them before making it to the other teams.

Why didn’t we realize what we were doing? Conway talked about this in 1968! Probably the same reason we went along with Waterfall for so many years; no one took the responsibility to make better decisions. Also, the culture became saturated in these mentalities quite quickly, which still causes challenges today.

Now that we can see the errors of our past selves, why do we continue to architect systems without architecting our organizations first? Many of the companies built up around technology are already doing this. You can see it in Amazon, Netflix, and Google. Amazon was a small bookstore. Netflix was a mail-order video store. And Google was a search engine. Any company can do this.

Amazon actually started with a monolith, which makes sense given that their organization looked more like a monolith. They weren’t a big behemoth at the time, so their organization had a lot of close, tightly-coupled communication patterns. However, they were able to transform their software architecture. They did this by architecting their organization.

Their organization was architected under the concept of the two-pizza team. These are small, persistent teams organized around domains. These types of teams use tools like Slack (standard protocol using a standard language) and shared dashboards (based on shared metrics) and tooling to communicate. Their applications communicate with standard APIs (standard protocol using a standard language like ReST) and have aggregated metrics using the same set of tools.

There are many variations to this pattern, but a commonality exists around the teams being small and domain oriented. Dividing teams by domain allows the right level of abstraction. I believe abstraction is the key to success in our field. If a domain is divided, then the abstraction becomes disjointed. This breeds confusion and disjointed communication.

Let’s take a real-world example: There are three teams that work on Unix systems within a company. Their descriptions may make sense internally, but those same descriptions are ambiguous externally. There is a ticketing system which fronts these teams to help customers interface with them. However, the ticketing system is also confusing to the customer. The ticketing was largely designed and implemented by a completely separate team without customer input. After the customer enters a ticket, they call a member of the team who should be handling the request. The customer is informed that it’s actually one of the other teams that perform that particular task. So, the customer then calls a member of another team and finds out the original ticket was entered incorrectly. Eventually, the problem is solved, and the customer learns the intimate details of the inner workings of the Unix teams. This is a very cumbersome and inefficient system.

A much more efficient and common model is to have one Unix team with a ticketing system created from the customer’s perspective. The ticketing system is really just an abstraction for an interface of any kind. This single Unix team can be divided, but it should be divided with the proper abstractions and fault lines. This should be done from the perspective of the customer, as well. Every company won’t have the same solution here, and that’s ok.

The key part is that each subdivided team has responsibility and authority over the long-term strategy, implementation, and operations of their domain. Splitting these functions can be dangerous. Abstracting strategy from operations removes the critical feedback loop that operating a strategy can provide for developing and modifying that strategy. Conversely, understanding the overall strategy can influence a team’s operational model. The same is true of each relationship among strategy, implementation, and operations.

So, how does this change someone’s job? What do systems engineers do today for Unix systems? They architect the infrastructure and processes to ensure security, compliance, and application teams have the necessary tools to do their jobs. They fulfill ad-hoc requests for new projects. They provision machines and apply base configurations through manual interactions and scripts. They help end-users troubleshoot issues as they arise. They automate the detection of system-level issues and attempt to stop them before the end-user is affected. They plan the long-term strategy for their domain. And they perform other routine tasks related to systems administration.

The real change is that instead of having multiple teams that do this task for all Unix systems, a company might have a single team in charge of all these tasks with each person in the team having the same responsibility set. Or, in larger organizations, the team may be divided according to Solaris, RHEL, and AIX. This is a sensible abstraction as these teams may need to work together, but their customers will largely be different and their configurations can be componentized and included as needed for each platform. This also requires that they report to a common manager to ensure continuity across those subdomains.

Another aspect that will change, is that more configuration will be kept in code. Instead of a systems engineer sshing into a host, he or she would update the configuration in a version control system which would initiate builds to test the change and ultimately deploy the new configuration to production after passing through development and test environments. Systems engineers will also be expected to write services for their domains. This should become the primary interface between the domain and the customers. No one goes to AWS and enters a ticket to create a VM or get storage. They interact with a service maintained by Amazon engineers who handle that domain space.

These same concepts can be used on the software side. Software engineers have long dealt in abstractions to ensure their applications can remain decoupled yet cohesive. This is something systems engineers will need to learn moving forward. If we use the most sensible abstractions when creating our organizational structures, then we’ll be able to create the abstractions our customers will understand and integrate with. The end goal is creating a better experience for our customers, so let’s not forget.