Microservices, monoliths and laser nail guns: Etsy tech boss on finding the right focus
John Allspaw has a long history on the web from Salon to Friendster to Flickr. For the past five years, he has been at popular e-commerce site Etsy, where he’s now VP of operations and infrastructure. Among software architect types, Etsy has developed a reputation as maintaining a monolithic application architecture in an era of microservices, often leaving Allspaw playing the role of contrarian.
In this edited version of a recent discussion I had with Allspaw, he talks about why the focus on labels is a distraction and how Etsy has been able to evolve its technology to keep pace with its business. The company now serves millions of buyers and sellers, hosts tens of millions of items for sale, and is making hundreds of millions of dollars in revenue.
Call Etsy’s architecture what you will, but it works
SCALE: I’m fairly ignorant about the Etsy architecture, but the one thing I do know about it is that it seems to be the shining example of the last remaining, large-scale, monolithic architecture. Is that an accurate assessment?
JOHN ALLSPAW: I think that is reasonably inaccurate. I think the issue is that the definition of a monolith and the definition of a microservice is pretty underspecified, and I’m willing to bet that if you were to find a hundred people in the field and you all looked at the exact same architecture diagram, you would get different answers as to whether something was monolithic or a service.
There was a talk at the Velocity Conference a couple of weeks ago in Santa Clara where that exact exercise was done.
Either Etsy is a monolith or it is a shining example of what service-oriented architecture looks like and/or microservices, depending on who you ask. Frankly, I find the beginning assumptions around this particular topic to be a massive distraction. I have differing opinions than, say, Adrian Cockcroft on the topic.
Can you elaborate on that point about why it’s a distraction?
There are a couple things. One, it unnecessarily paints architecture into binary camps. I think that anybody who’s at all worked with scale and anybody who’s evolved systems to scale knows that architectures are context-specific.
For example, a good friend of mine runs and has run an electronic trading exchange. You could imagine his goals and constraints when designing an electronic trading exchange are very different than, say, Facebook. Facebook might be very different architecturally because they have different constraints than Amazon. And Amazon might be different than even Etsy.
When you have a conversation that unnecessarily paints the discussion as, “Are you micro-services or are you a monolith?” then it wipes away all of the context-specificity. Which you actually have no real way of talking inspecifics.
“There are other languages at play at Etsy, but as far as the guts of the API and the web application, it is largely PHP. There’s a huge advantage in that, a massive advantage.”
The reason why I think it’s a distraction is because — and I see this all the time — when somebody says, “Well, we take a microservices architecture view, and here are the reasons why we do it,” you’ve got to get pretty far into the conversation before you actually see how it’s built. You’ll have them whiteboard it or show you a diagram — not a conceptual diagram, an actual diagram with specifics — then you look back and you say, “That is not actually the diagram of what I would imagine something microservices are.”
How much did you talk about the architecture and how much did you really just repair your shared misunderstanding of definitions? I guess what I’m saying is that it’s too abstract. It’s fine if you want to talk about an approach, but it’s under-specified. The advantages and disadvantages of microservices, and the advantages and disadvantages of a monolithic architecture — even if you can agree on what definitions look like — are so context-specific that I’d almost rather just get out from underneath the abstractions and talk specifics from the get-go.
It sounds like you’d rather focus on the work being done than how it’s being done.
I’m saying that usually this topic is around architectural choices. It’s quite reasonable and fine to put those within the context of organizational decisions, which is great. Usually this is where people will trot out Conway’s Law to justify going in one direction versus another. This is good and fine, just that there’s a lot more to the story than just those two words.
OK, but what is it about Etsy’s architecture that people look at it and say, “Oh, this is a monolithic architecture.” And whatever it is, why the right architectural choice for Etsy’s business?
One of the things that you’ll see us talk about in a number of places is that we want to exploit the advantages of having a relatively finite number of well-known tools. For example, is PHP the right language for the majority of the web app? Almost certainly not. Can we find places where a different language would be more optimal? Absolutely.
It’s that nuance of knowing that the advantages of being more optimal do not outweigh the advantages of using the same language a lot. There are other languages at play at Etsy, but as far as the guts of the API and the web application, it is largely PHP. There’s a huge advantage in that, a massive advantage.
In the same way, there’s a massive advantage in using a default data store, MySQL. Is it the right data store for storing, say, favorites or listings data? Is it the right data store for storing shop statistics? Almost certainly not. But the advantages of storing it in a shared — we would call it sharded, federated, distributed — data store.
If the default is MySQL, then we get to reap all of these benefits. It means that is an engineer goes from one team to another, they’re not going to have to relearn everything.
If you want to integrate the feature you’re working on with a feature that somebody else is working on, then you can inspect, you can reason about other people’s code because it’s a lot of the same programming patterns. The data is stored in roughly the same place.
I think you can make the argument that Google, with its use of BigTable and its own data stores, does the exact same thing. I don’t think that you would say, “I can’t believe how monolithic Google is.”
“If you think of the evolution of a technical infrastructure of a growing web property, then there are these identifiable episodes in the evolution. You’re either making a relational database work in a distributed way, or you’re not.”
Understanding, not avoiding, the cutting edge
You mentioned MySQL. Facebook and other companies get this reputation of being the champions of scalable MySQL, but Etsy is not a small service. Has it been a chore to scale your database layer as the company grows?
There’s a couple of ways to look at it. If you think of the evolution of a technical infrastructure of a growing web property, then there are these identifiable episodes in the evolution. You’re either making a relational database work in a distributed way, or you’re not. To be fair, at a high level, we don’t take a much different approach to federating data across many MySQL servers than does Facebook.
That’s more about the architectural pattern than it is about any of the tech. … For example, instead of having all of Etsy in one database server, usually the next thing people do is, “All right, let’s take our favorites, listings and user profiles. Then we’ll get one database for user profiles, one database for favorites, and one database for listings.”
That functional partitioning also has an expiration date on it. Then you have to take the leap to make it so that we’re going to store the majority of Etsy across many, many machines and make it so we can balance data between them.
None of that is really MySQL-specific. There are some tips and tricks, certainly. You’re going to want backups to work the way you expect them to. You have to do extra work in the application, like finding the database server that has the data you’re looking for.
But, again, that’s really just an architectural pattern. At that point you’re using the database as a reasonably done data store, which is you want because you want it to be really good at being stable. You want it to really be good at being reliable. All the other levers you’ll be putting in your application anyway.
It could be any other database really. There’s nothing special about MySQL.
“We want to plan for a world where stuff breaks all the time. And we want to make it so that when things break they matter a lot less.”
You mentioned you’re using this set of well-known tools that can handle a wide variety of stuff. But what are examples of what you might call new or cutting-edge or next-generation stuff that Etsy’s using?
I’ll be even more descriptive about it. I would say that we want to prefer a small number of well-known tools. …
If I find myself trying to unscrew a screw with the end of a hammer, then it’s probably time for me to think, “Well, this effort is not going to be worth it. I’m going to need a screwdriver.” Having said that, it also doesn’t mean that I’ve got 1,000 hammers — one for marble, one for balsa wood, one for plaster.
It’s not like an edict that says, “These are the blessed tools and everything else is forbidden.”
Instead, what we do is we do process-wise, or largely culturally, is we identify use cases that are departures from the norm. An engineer says, “Here’s this problem. I don’t think I can solve it with PHP, MySQL and Linux … or Hadoop or Lucene or whatever.
“Here’s what I tried. I tried to use those things, and here’s where they fell down and I don’t think they’re good. I really don’t want to use anything new, at least without any good reason.
“So, everybody, my peers in engineering, does anybody else have any good ideas? I think I’ve landed on this new piece of software. I just want to make sure before I keep going with this that everybody knows that this is a thing that we’re all willing to get good at.”
“I would rather have … carpenters because they’re really passionate about solving hard problems, given the choice between them and those candidates who say, “I don’t care what I build. I just need to use the laser nail gun.”
Redis — and this was a number of years ago — was one of those departures. Elasticsearch has been one of those departures. Sharded Solr is one. About half of our search is in Solr, half of it is in Elasticsearch. There’s some various storage engines that are a part of MySQL that were departures.
The thing is, when you pull something shiny and new off the shelf, there can be operational overhead. If it breaks and you’re the only one who knows how it works, then it probably wasn’t a great technical choice. It can be a really good technical choice if you’re planning for an optimal future. We don’t want to plan for an optimal future.
We want to plan for a world where stuff breaks all the time. And we want to make it so that when things break they matter a lot less, that they’re not critical. That they break and we can fix them and we can adapt and be resilient.
One of the ways that we do that is taking a critical-thinking look at the choices that we make. We don’t want to have choices made by the very well meaning, well intentioned, but very enthusiastic engineer who didn’t think everything through. No single engineer is going to think of all the contingencies. That’s why we want to take a much more diverse look.
Then when we say, “Alright, this is the thing. Redis — we’re going to use it. Here’s where we’re going to use it. Here’s where we’re going to get good.” Then we’re actually going to get good at it, which means that we’ve got a lot more confidence.
The one thing I keep hearing is that when it comes to hiring, people like to know they’ll get to work on new things. Does is affect who Etsy can hire if prospects don’t think, “Yes, I’ll be developing in Golang in the next three months!”
Sort of. I’d put it this way: I personally would take the same approach if I were hiring carpenters to build a house. I want the carpenters to be psyched to get on the job because of what the design of the house is and the challenges. We’ve got to build this museum on the edge of the cliff.
I would rather have those carpenters because they’re really passionate about solving hard problems, given the choice between them and those candidates who say, “I don’t care what I build. I just need to use the laser nail gun. I don’t care if it’s an outhouse. I don’t care if it’s a barn.”
Those engineers will have a lot more cognitive space. They’ll also have a lot more focus of attention on solving the problems, not on a particular chit. The song matters more than the guitar.
But there’s nothing that says you’re not going to work with tools that are going to be great for solving particular problems. In fact, we write a lot of our own tools because we can’t find the tools that really fit our use case.
As it turns out, there’s a lot of really hard problems here — incredibly hard engineering problems that actually don’t have anything to do with the tools. They’re just hard problems.
The more well-known one that we’ve been talking about recently is recommendations. We’re not a regular e-commerce site. We’ve got millions and millions and millions of unique things as opposed to a very small number of unique categories.
It’s like one long tail.
It’s all tail, basically. We could say that. Our data science and engineering teams … they don’t want to spend more time messing with their tools than they need to because they want to solve the problems. How do you suggest something for somebody to buy when there’s only one of those things in the world? That sort of thing.
“Like a lot of companies that decide to remain on bare metal, we just have the staff to make those efficiencies. We just exploit being on bare metal very, very well.”
Speed matters, so hardware matters
What does the core infrastructure at Etsy look like, in terms of what the site is built on? I recall reading that you run largely on your own hardware.
At a high level, we have mostly, but not all, on-premise server capacity. We still use the colocation facilities and that sort of thing. Having said that, we are a heavy user of Amazon’s S3 for the storing of a long tail of images because that’s easy. We’ve used various tricks in order to insulate us from issues that Amazon might have or that S3 might have.
Largely, we get our own servers. We do have a very small but quite, quite efficient datacenter team. We do what we can to make sure that deployment and provisioning is all as automated in all of the safe ways that we can.
We’ve done a number of blog posts around our hardware choices. We’ve done a couple of blog posts around our big data stack.
That’s one thing worth mentioning. What a lot of folks — us included — did when it became a thing was use Elastic MapReduce in AWS for a lot of analytics jobs. But then over time we effectively outgrew it. We were going to be loads more efficient by having it in-house on a new cluster. That’s been solidified for a couple of years now. We’re reasonably happy with that. It works, and it’s quite good.
Like a lot of companies that decide to remain on bare metal, we just have the staff to make those efficiencies. We just exploit being on bare metal very, very well. We’re quite good about making sure that workloads are appropriate.
Remember, we’re not a news website. The use case of spinning instances up and spinning instances down in the cloud is not really a use case that we have — largely because we’ve been around this long. We can reap the benefits of bare metal a lot more than having to move our infrastructure to the cloud.
When you talk about efficiencies, are those in terms of cost, performance or control?
Largely it’s about being faster. We can eke out more. It helps to have the author of PHP working at the company, I’ll give you that. We can handle on a per-node basis a lot more than going with a general purpose infrastructure-as-a-service provider.
Periodically we’ll do the math and try to figure out, from a financial standpoint, if would it be more efficient to go to cloud. Again, I guess because our workloads are reasonably predictable, we just do some plain old capacity planning, and it works.
Why business problems dictate technology choices
Compare Etsy to some of the other places you’ve worked. For example, I’ve noticed that on a regular basis, Etsy updates its stats and talks about its uptime for the year and page-load speeds and all that. Does this have something to do with the user expectations at Etsy?
The answer, I would say is that it’s different in some really exciting ways.
Let’s take Flickr. There are a couple of differences that change the expectations of the person using the site. In the case of Flickr, you’ve got content producers — people who take photos. Then you have the people who are looking at the photos. That overlap is a lot closer to 1:1 than it might be at Etsy if you were to look through the lens of a buyer and through the lens of a seller.
If you’re a seller on Etsy, uploading photos and listing things for sale involves a whole bunch of activities that are very specific and need to just work. As far as fault tolerance is concerned, it’s better to fail closed than to fail open.
What do I mean by that? If I’m listing a table for sale, there’s a whole bunch that goes in that: I’m going to give you the description; I’m going to give you the title; I’m going to tell you about the materials; I’m going to tell you about where I ship; I’m going to give you the policies around what I have from my store and that sort of thing.
This is very different than “I’m going to upload a photo.” When you upload a photo, it either works or it doesn’t. When you upload a photo on Flickr you don’t have to review it. You’ve uploaded it. It actually can’t go away. No one buys it, and then it’s no longer there anymore.
In order to make sure that there’s an acceptable experience, especially in the face of degradation, you take a different approach with graceful degradation. At Flickr, if something was broken — let’s say favoriting was broken — we would turn off favoriting in order to save the rest of the site. You just couldn’t favorite. We do that for a lot. Because consumers and producers are the same, largely, we could do those things evenly.
Whereas on Etsy we have to think about things a little bit differently. If something is wrong with listing things, buying things and searching things could totally be fine because we’ve decoupled those things specifically. Maybe sellers can’t upload new listings or renew the listings if something’s broken, but no one knows that except for the sellers.
I think the answer to your question is that we will architect forfailure cases differently in the case of Etsy than you would at Flickr, maybe even other companies. It’s an over-simplification to say that it’s because money is involved, but that certainly plays into it.
“We want infrastructure and code to be adaptive. Not adaptive in an artificial intelligence singularity way, but adaptive in that it’s malleable. We’ll be able to bend it to our will very easily.”
If you were to head off to your next job tomorrow and got to build the tech infrastructure from scratch, what would you do?
There’s at least two ways I would answer that. The first one is not likely to be very satisfying: it depends, of course, on where I would go. If I went to work at a trading exchange, I might take a very, very different approach.
If I were to work at a different e-commerce company, which I can’t imagine doing, there’s a huge amount of architectural patterns that I would simply use again because they are battle-tested, they are well-worn. It is unlikely that from an architectural pattern, they would look significantly different than what we’ve built here, which is more probably about our critical-thinking approaches than any given technology.
The absolute one thing that I would do exactly the same is that in five-plus years now at Etsy, I am still overjoyed to see that the approaches we take to software development — to operations and security and all of these things — is they’re truly human-centric. That is to say that we don’t simply punt on solving hard problems. We don’t believe in the search for a magic algorithm.
Algorithms and things like machine learning and deep learning, these are all quite important, but they are only tools in a box. The thing that I would do at any other company is to write software for people to reason about.
I’ll paint a stark picture: Given a piece of code that is absolutely optimal — it is faster than any other piece of code that does the exact same thing on earth, only no one knows now it works except the author — versus a piece of code that was written by someone in order for it to be augmented, modified, adapted, flexible and all of these things, I would take the latter.
We want infrastructure and code to be adaptive. Not adaptive in an artificial intelligence singularity way, but adaptive in that it’s malleable. We’ll be able to bend it to our will very easily.
That’s the thing I’ve learned. That’s what I would take away. Sorry for being vague, but I think it’s hugely important.