Microservices: Building Things is Easy; Building the Right Thing is Hard

Joy Ebertz
Box Tech Blog
9 min readMar 31, 2018

--

Historically at Box, a large chunk of our main web application codebase has lived in a single monolith. There are a lot of pros and cons to this and I would argue that when we started as a small startup, this made the most sense. However, over the years, that monolith has become entangled and slowly grown and grown. Similar to the problem many other companies face with similar monoliths, we got to the point where adding more engineers to this codebase would not allow us to actually move any faster on features we wanted to build. The chance of a bug in some way blocking the workflow of other engineers goes up with each person and the size and complexity built up over years further exacerbates this problem. We had reached a productivity plateau. We decided that the right choice is to split up our monolith into microservices.

As we move our code to microservices, we have a unique opportunity to positively affect our architecture for years to come. We all want to create a separation of concerns and create clear service boundaries, but what does that mean in practice? We know that we want to create APIs that control the entry points into a service, but what do those look like and how do we think about those? How do we make sure that after a lot of work, we don’t turn around and realize that we’ve just created a distributed monolith? It’s easy to build a microservice, it’s hard to build the right microservice; it’s easy to build an API, it’s hard to build the right API.

High Level Service Design

As we started down this path to building microservices, we very quickly identified that we wanted to spend time early on creating a high level architectural vision and architectural guiding principals. We wanted to make sure that as we built out new microservices, we didn’t end up with a Frankenstein architecture of mismatching sizes and no clear boundaries or with a distributed monolith.

My team that was tasked with creating a blueprint for the long term vision of Box’s high level architecture. We thought about questions like:

  • what pieces of functionality are related?
  • Which ones are unrelated?
  • How closely do two different things live to each other?
  • How small should a domain be?

We used Domain Driven Design to examine the Box problem space and divide the entire space into smaller subdomains to consider. We laid out what high level functionality (both existing and from the roadmap) should belong in each subdomain and worked with subject matter experts from both product and engineering to align our breakdowns.

In addition to domains, we also spent a lot of time thinking about how we wanted to approach horizontal splits. Splitting by domain allows us to divide our problem space into vertical chunks, but ideally we also want to have common thoughts and patterns around how we split the space horizontally. For example, we don’t want to mix presentation logic for one particular client into our core domain logic. Likewise, a field that only one client cares about should also be separated from everything else. At a high level, we should have a clear separation between our application services and our domain services. An application service should contain anything that is specific to a particular client or application (or possibly even several clients) while domain services should provide all of the more common business logic and objects. Within the domain services, we further break it down to think about two types of capabilities they can provide — business capabilities and foundational capabilities.

  • A foundational capability is the most basic building block, like a file object. It includes any business logic that is tightly tied to the object and will provide a separation from the data model such that the data model could be changed if needed. This capability owns all of it’s own data.
  • A business capability meanwhile is usually a piece of business logic built on top of one or more foundational capabilities. For example, moving a file from one location to another is a more complex process that may involve multiple foundational capabilities. Business capabilities should still provide common reusable functionality that can be used by multiple clients.

If you want to read more, what we settled on is largely similar to this article by Praful Todkar and Ryan Murray from Thoughtworks.

So this is all great, but how does all of this high-level thinking translate into a service?

Exploring the Content Space

After wrapping up our high-level thinking, we started to dig into a specific area to validate our vision and architectural principals. Specifically, we chose our content (files, folders, etc) domain because it is the heart and soul of Box. If our proposals didn’t work there, we would need to start over. We put everything including the data model on the table when we started thinking about the content space. If we were to start over from scratch and completely re-design our content domain, what would we do? We examined a bunch of areas that we know are problematic from an implementation perspective as well as areas that are problematic from a user perspective. On top of all of that, we talked with our product managers about our top user requests and their product vision is for the space. We considered many angles and models. What would it look like to not require that a file lives in a folder? What if we changed our item access permission model to ACLs? What if we allowed groups to own something? Or an enterprise? In the end, we all came to the agreement that while there are things we would change if we were actually starting over, because of the fact that users are already expecting certain things, we can’t do any fundamental paradigm shifts. There are still certainly things we can and will do, but we also still need to be able to support the existing paradigm in addition to anything else.

Once we settled on starting by moving the existing paradigm, a lot of the other pieces started to fall into place as well. While we’d still want to change many things about our data model, that doesn’t need to be done at the same time as when we move our logic into a microservice. If we pull things out correctly, we can create enough abstraction that we can completely change how our data is stored within the microservice later without actually affecting anything else. This allows us to break up the work into separate, more manageable chunks: getting the items logic into it’s own service and and getting the logic into an ideal state. This is especially important given how complex and large the space is. On top of all of this, we decided to start with the simplest piece of this puzzle as our very first chunk — look up the file metadata by id.

Designing Our New Files API

Once we got to this point, we thought the hard design work was done. We were wrong. We had decided that we were going to allow our monolith and other services to fetch a file by id, but what is that file going to look like? What fields will it have? We knew that our current file object has a lot of extra fields that should probably actually be owned by different services. We also knew that there are derived values that while not actually part of the data model, others might consider to be part of a file. In the end we considered 152 fields when deciding what to expose. We could have made this design process much easier by choosing to expose everything currently on the database model or everything currently in our public API. Easy is not always best. In the end, we went with a subset that matches neither of these. By not matching, we were able to make sure that we are exposing a list of fields that are all clearly core to file and that also allow us the flexibility to change both functionality and the data model in the future.

Our first implementation is exposing 11 fields, which is clearly way less than 152, so how did we pick those fields? For each field, I started by categorizing the field into one of 4 categories:

  • The field was dead (no longer used) or I could easily convince our product manager to get rid of it.
  • The field was clearly related to something that should belong to a different service. For example, there are currently a large number of preview related fields stored on the file that should be owned by and accessed through the preview service.
  • The field was a convenience field. We have a number of fields that are currently exposed so that the webapp or other clients don’t have to make multiple calls. While we will likely eventually want to expose some of these, they should be exposed at a higher level of the stack or as a part of a separate endpoint, not as a part of the domain service’s foundational capabilities. Additionally, some of these convenience fields involve either calling other services or denormalizing information that should be owned by a different service. The former adds lag time into every call even if the caller doesn’t need the field and the latter risks inconsistent data and a number of other problems. Additionally, it goes against the core concept that services should own their own data.
  • The field was a candidate for consideration to be included.

This initial categorization cut the fields down to a much more manageable 26 fields.

From here, I considered a couple of additional things. First I considered if the fields made sense as a part of the main file object or if they made more sense as a separate entity. For example, I found that we had a number of fields that were basically some event that happened on the file, who did it and when (last edited, restored from trash, etc). While these are all clearly related to the file object, they are also different from the main file metadata such as name and description. I also recognized that we may want to add more fields similar to these in the future. As such, we decided it made more sense to have a separate endpoint that returned file activities rather than including these fields in the main files endpoint.

Secondly, I considered what the entire suite of APIs might look like as a whole. For example, we plan to have both a file and a file version endpoint. Given that, what fields might make sense on the file endpoint? Which ones on the file version? Both? How might these work together? What would a POST to the endpoint look like? Does that change anything? Interestingly, how these things are exposed also changes available functionality. For example, if we have the description as a property of the file rather than the file version, then we won’t be keeping versions of it and the expected behavior will be that if I revert to a previous version of the file, the description won’t change.

Once we narrowed down the fields, we were done, right? Wrong. Even once we had decided which fields made sense, we got into a whole new round of considerations. What should we name everything? What are our responses for edge cases? How should various error cases be exposed? How should we handle field-level permissions (i.e. I have permission to see the file object, but not who updated that object)? We got into debates on the side of API design patterns — do we expose more information but make the API harder to consume? Or do we expose less but make it really easy to write against? We also got into debates on how we wanted our service to function in the future and how that would affect the API. For example, we eventually want to have real folder objects for each user’s root folder. While we don’t have that today, we can do something to make it appear that way to outside users by exposing something like ‘user-root-{userId}’ as the folder id for the root folder. However, what would happen then when I request that folder with a `GET /folders/user-root-{userId}`? Does it return a 400? Does it return an object with some hard coded fields? What about if a user tries to PUT to it? What would we want the long term behavior to be? What would the migration path look like from here to there? If we’re unable to deprecate whatever we expose now, will we be okay with that living in conjunction with the long term solution?

After all of these considerations, debates and work, we now have our first API designed and have started building it. I will be the first to admit that we’re likely far from done. This will be an evolving process and we will almost certainly add more fields in the future. However, I also feel confident that we have also put ourselves in a good position to allow us to make many of the changes we want to in the future without tying ourselves to existing problems more than we have to. Building an API or even a simple service often seems like a very easy task — and the actual building is. However, making sure to build the best service or best API is actually very difficult and requires a lot of careful consideration.

--

--

Joy Ebertz
Box Tech Blog

Principal Software Engineer & ultra runner @SplitSoftware