Software architecture needs to be documented. There are plenty of fancy templates, notations, and tools for this. But I’ve come to prefer PowerPoint with no backing template. I’m talking good old white-background slides. These are way easier to create than actual text documents. There are no messy worries over complete sentences. Freedom from grammatical tyranny! For a technical audience, concision and lack of boilerplate is a good thing. A nice mix of text, tables and diagrams gets the point across just fine. As a plus, this is naturally presentable — you don’t need a separate deck to describe your architecture when the deck is the reference document to begin with. As the architecture evolves, the slides evolve.
I’ve done a couple of architecture-level projects in the last two years, one of which got built and delivered as a SaaS platform that I’m discussing here. The first thing I captured were three- to five-year business goals. These were covered in an earlier post. The next section in the deck are the architectural principles.
Architectures exist at a higher level of abstraction than designs. The whole point is that architecture establishes some important constraints but otherwise leaves plenty of room for expansion and choice. No engineer wants to work on a system where their only job is to translate documents to code with no freedom of expression. If we’re successful then this architecture-level material will be interpreted to produce loads of detailed designs and implementations. In this context, it helps to establish some principles up front. These get the project off on the right foot. Down the road they might be “tie-breakers” for trade-off decisions, used to give newcomers context, or just give people some inspiration. Principles can also be useful in establishing culture.
Here’s a rundown of the architectural principles that we set out at the start of our SaaS effort.
Use SOA Patterns
We come from a three-tiered background. They aren’t the typical three tiers of a web app, but we have three tiers nonetheless. Roughly, they correspond to “presentation”, “business logic”, and “embedded system”. This legacy architecture has served us well but there are shades of monolith— especially in that middle tier — that have become a problem over the years.
So our first principle declares right off the bat that our SaaS will be composed of self-contained services accessed via standardized, versioned REST APIs. There is some cost and complexity inherent in SOA but we judged it to be worth the benefits. Also, the architecture stops short of calling for a micro-services level of granularity. When decomposing requirements into services we looked for a balance between the single responsibility principle and cohesion. I did not formally use Eric Evans’ Domain Driven Design but his definition of domains and bounded contexts was influential.
In our architecture, any individual service must be authoritative over its domain. To the extent that services need to interact with each other then they use the same REST API that end-user clients use. This is an abstraction that can leak — we are willing to invest in more performant APIs as necessary, but the REST API is the least common denominator that is always available. It is an iron rule that services cannot go “behind each other’s back” by reaching into foreign datastores. We pay a slight penalty for this bit of architectural purity but we also retain the ability to reason about services’ internal state.
Fault domains are isolated; a failure of one service does not automatically affect others. Again, there is some leakage here. Services may stand alone on paper but once they act as clients of others then we reintroduce a degree of fault coupling. Fortunately inter-service dependencies are explicit and obvious so we can reason about those as well. This is an improvement over the monolith where dependencies tended to be hidden in a mountain of code and complicated internal call chains.
Finally, the service-oriented approach allows for independent horizontal and vertical scalability. We’ve appreciated being able to scale in both dimensions. Horizontal scalability gives us desirable high-availability (HA) properties. Vertical scalability lets us keep our overall number of service instances manageable. In particular, this allows for deferring some complexities of SOA — like service discovery — until we’ve scaled the business substantially.
Create Services That Can Be Easily Run in Production
Although I had not heard of Twelve Factor applications when I first drafted our SaaS architecture, I still managed to codify many of their operations-related items. We also happen follow many other Twelve Factor recommendations as well. I think it is good stuff and not a bad place to start for any new project.
This principle requires that service configuration is externalized for increased visibility, understanding, and control. We revision-control config files in both test and production deploys. External API dependencies are explicit in these configs. As mentioned above, this makes it easier to understand inter-service relationships.
All services write structured log messages, easing aggregation, filtering, and analysis. Bare printf’s to stdout or stderr are forbidden. There are many choices of format here and our implementations usually log in JSON for simplicity and maximum compatibility.
Because they must be horizontally-scalable, all services must be stateless. Anything that is persisted is in an external datastore. This facilitates deployment behind a load balancer without the use of session cookies. Services are also written in a crash-only style so that when they do crash (which is rare but it does happen) they may simply be restarted without manual intervention.
Process- and service-level metrics are exposed via an internally-accessible API endpoint. We initially also surfaced business-level metrics via this same facility but have recently decided to switch them to an event-publisher approach. From an architecture point of view, the important point here isn’t a debate over metrics vs events, it is that the services are instrumented. A full array of metrics and events is available at runtime and we do not rely on logging for this information. Logs are for point-wise debugging only.
Avoid Single Points of Failure (SPOF)
In a SaaS architecture, this principle almost goes without saying. Service, component, and infrastructure failures are all expected. All services have an HA story and this must incorporate the HA properties (or lack thereof) of any dependencies. The biggest complications arise with the use of external datastores.
This principle was easy to write but is all but impossible to ensure in practice:
You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done. — Leslie Lamport
So, yeah, this topic is a never-ending source of lessons. In production operations we see failure modes that we’d never considered. But this is still a worthy aspirational goal. We’d like to avoid obvious failures via application of some straightforward design rules.
Bring Features and Capabilities to Market Incrementally
We’re delivering new products to new markets via new channels. There is clearly much that can go wrong. From a software delivery point of view we need to optimize around rapidly delivering small changes instead of large releases with “all-or-nothing” features. This increases responsiveness and minimizes our cost of mistakes.
We control feature visibility and behavior using feature flags. Fundamentally, flags allow us to decouple deployments and releases by turning the choice to expose a feature from a static compile-time or deployment-time decision into a dynamic variable. Targeted roll-outs under the control of a flag allow us to show new features to specific customers (or even specific users) while they remain hidden from others. Flag settings can be controlled from a dashboard without recompiling or redeploying.
Regardless, features aren’t “done” until code is deployed in production and widely available.
Master is Built and Deployed After Every Commit
All commits to our master branch are built, unit-tested, acceptance-tested and then deployed to a test environment. This happens via a completely automated pipeline. A failure at any stage stops the pipeline, however failures are rare since prospective merges (i.e. Github pull requests) are run through the same build and test cycle before the fact. This is a lightweight form of continuous deployment. I say “lightweight” because every build that comes from the pipeline is deployed to test instead of production. Deploys to our production environment are still manually-triggered but can happen at any time — this is continuous delivery. Our rule of thumb is that any build that is running in test should be capable of being deployed to production without further manual intervention. If this isn’t true, then it is a problem that must be fixed immediately.
Even though we manually trigger production deploys, we use the same automated tooling as in the test environment. We also use the same build artifacts without rebuilding. We’ve done thousands of deploys this way — all automated — and we have seen our tools break in some interesting ways. We just fix issues as we go, further increasing reliability and confidence.
This may seem like a strange principle to codify in an architecture. It’s… a process-oriented thing. Remember when I said that architectural principles can be useful in establishing culture? This is one of those principles. An architecture doesn’t just have to cover development concerns. It can also cover operations- and process-oriented concerns as well.
Also, based on this principle, 100% of our developers who work on our SaaS write their own unit tests and acceptance tests. It is just “the way we do things”.
If You Build It, You Live in It and Help Run It
This is another culturally-motivated principle. Because my company is not coming from a 24x7 web operations background we don’t have a strong operations capability. Happily, we also don’t have a big “Development vs Operations” cultural divide to overcome either.
By necessity our engineers contribute to infrastructure and tools development. Just like worrying over deploying tooling and test authoring, it is nobody’s job because it is everybody’s job.
Developers are frequently involved in deployment and operations, including watching their code’s behavior in production. This helps us “run lean” in terms of hiring dedicated operations staff. It also helps reinforce some of the points above. The value of low coupling, statelessness, config file controls, logging, metrics, no SPOF, and high test coverage is really brought home when there is a thin line between you and the end user. It is also interesting that more than one developer has found a latent bug in their code just by watching operational dashboards.
Stick to Value-Added Development
Finally, we make heavy use of free open-source software (FOSS). Entire categories of problems like streaming and cluster scheduling are solved for us. Our challenge is to learn how to best integrate, use, and operate these systems.
Although the architecture has evolved as we’ve gone to market and tested some assumptions, these principles have not changed.
I’ll unpack some of the detail behind these principles in coming posts.