CTO to CTO: Werner Vogels and Don Neufeld
AWS Startup Spotlight
I recently got an email from Amazon CTO Werner Vogels asking if I’d be open to a Q&A. He explained:
“We are starting a collection on Medium with stories, experiences, and lessons learned from startups building on AWS…I think it’s fitting to start this collection on Medium by featuring Medium — itself a startup using AWS.”
What follows is a Q&A between Werner and I about the technical side of Medium.
First, how did you find your way to Medium? And how did your experience at Ohai and Sony prepare you for building Medium?
This is a long story, but I think it’s relevant. It starts when I moved to America from Canada to join Sony as an engineer in May of 2000. I had wanted to make video games ever since I was a kid and had built some game projects on the side during and after college. After working in web and desktop software development for a few years, I finally decided to try game development as a career. I had exactly one interview and was hired by Sony Online Entertainment to work on PlanetSide out of their St. Louis office. PlanetSide was one of the only massively multiplayer, first-person shooters ever created, and in retrospect, it was a huge trial by fire for me and the entire team. These were the early days of Massively Multiplayer Online games (MMOs), when companies were trying out a lot of different types of games to see what worked in this new format. There was very little known about how to do this well, and we had to figure things out on our own.
As a new engineer, I did a bit of everything. We needed a UI, so I built that. We needed a physics engine, so I wrote that, decided it was terrible, threw it out, and integrated a third-party one. Later, I took over the gameplay engine and owned the server architecture. It was an intense three-year experience, and we all worked extremely hard just to get the project out the door. There were moments when it almost didn’t happen.
After PlanetSide shipped, the team relocated to San Diego, and I worked on a short-lived project code-named “Guns for Hire,” for which we adapted the PlanetSide engine to a modern-day, urban setting. On this project, one of the back-end problems I faced was building a matchmaking system that could filter a large data set of available players, then match and stream them to potentially millions of clients in real time. I also wanted this to be something that any engineer could plug into and use easily. I spent a few weeks building a data sync framework using some meta-programming techniques I was excited to try after reading Modern C++ Design. The resulting framework did a great job, but I found that it was hard on the compiler and tool chain. It was a promising start, but there were clearly a lot of problems with the approach. The error messages were universally inscrutable, and debugging was challenging. I often had to drop to a disassembly view during development to understand what was happening, because the IDE couldn’t map the instruction pointer to source code through the layers of macro and template expansion. Type safety was lost once you got deep into the system. While the framework did eventually ship as part of a different game, I was ultimately dissatisfied with its usability. This effort would prove extremely important to my later work; eventually I’d get a chance to revisit these problems.
After Guns for Hire wrapped up, I moved to EverQuest II, because I preferred role playing games and wanted to work on one. In addition to my managerial responsibilities, I continued coding with a focus on the player guild and matchmaking systems. However, as I was building these systems, I noticed that I was once again spending a lot of time writing manual interprocess synchronization code. I had inherited a high-level architecture that repeatedly got in the way of rapid development. I missed my data sync framework and saw applications for it everywhere.
After three years of working on EverQuest II, I decided to strike out on my own and left Sony to co-found a gaming startup that eventually became Ohai. After some initial searching, we decided the opportunity we wanted to pursue was to bring the engaging experience of interacting with others in a synchronous environment from MMOs to the newly forming social gaming space. This was a challenging problem. Building a multiplayer game engine capable of supporting an MMO’s game rules is hard, and doing it with the small engineering team we had was going to be tough.
To be able to tackle this challenge, I leveraged my thinking about data sync architectures in a new way. I put the next generation of my thinking about distributed systems at the core of Ohai’s development process: a code generator that understood what it was building. It ingested a declarative language that described the game objects, replication rules, network connections, and processes. It knew what data needed to move where and wrote all that code automatically. If you can imagine something like Thrift or Protocol Buffers being a small fraction of your code generator, that would capture some of the scale. Because it was a static code generator, there were none of the previous problems with toolchain compatibility, and I could support any language (the final version of Ohai’s framework used four languages between clients, servers, game editors and toolchain). This development process proved agile to iterate on, and our designers were happy with the flexibility the engineering team provided for them.
Unfortunately, our platform was designed to build synchronous multiplayer experiences, and, with almost no exceptions, those proved unsuccessful on the Facebook platform.
I had a built an amazing solution…
to the wrong problem.
It was time to try something new. I was introduced to Ev through my co-founder Susan and joined what became Medium in July of 2011.
So, how did this experience prepare me for building Medium? Ohai gave me an understanding of the whole organization, the ability to look beyond the engineering team and tech stack. At Sony, my focus had been exclusively engineering, and I saw everything in those terms. At Ohai, I had taken the engineering organization to its limit, and it wasn’t enough. I had to develop a broader perspective to be able to answer questions like: was what we were asking the engineering team to do even appropriate? I also came to see more clearly how our engineering choices enabled the larger organization to maneuver and succeed—or not— at various points in its lifecycle. Coming from a purely technical background, this was a big change.
Between your experience at Ohai and now at Medium, what have you learned about being the CTO of a startup?
There have been two big takeaways that changed how I work. The first is to look deeply into the stack of implicit assumptions I’m working with. It’s often the unspoken assumptions that are the most important ones. The second flows from the first and it’s to focus less on building the right thing and more how we’re going to meet our immediate needs.
There are two implicit assumptions that underpin the desire to build things the right way. First, that it’s even possible to know what “the right way” looks like at that point in time. Early in building a startup it’s unlikely you know much about what the future holds, which makes that a risky thing to assume. Your biggest risk is that you’re in an information vacuum in terms of what the market thinks of your product. Your most important job is getting the company out of that vacuum. The second assumption is that if you make an investment; you will receive the future benefits it generates. The assumption here is that in the future, the company will still be doing something sufficiently related to the investment such that any benefits will make a difference.
To make this more concrete, let’s revisit some of the decisions I made at Ohai. At the time I left Sony, I had been building long-lived software systems for about 10 years. This experience informed my thinking about how software should be built. I believed in building for the long haul, automating early, and keeping the amount of technical debt low. My opinions on software development processes translated directly into my technology decisions, specifically the decision to invest so heavily in a vertically integrated code generator. In some ways this decision was good; it reduced complexity in certain layers, made design changes very easy, and enabled a small team to build a compelling game experience. But in the end the hidden downside of all this investment caught up with us. By optimizing our development process to support building a synchronous game experience made it harder for us to see that we were building the wrong product for the market.
Can you tell a bit about the technology organization at Medium?
Engineering at Medium is approaching thirty people right now, and growth is accelerating. When hiring, we look for engineers who are curious, aware, resolute and empathic. The results speak for themselves in our team and in our culture.
Our workflow is pretty strict and designed to optimize for a mix of team learning and codebase health. We require that master always be in good health, and we deploy and test it continuously. All work (even bug fixes) happens on short-lived feature branches. Code review is required before your pull request may be merged into master. We consider this a standard best practice not just because it helps catch and correct problems but because that feedback cycle between committer and reviewer(s) is so important for learning. I’m often surprised by how many companies I talk to that don’t make code review a strict requirement. On the nitty-gritty side we prefer commits be squashed into larger logical commits for history legibility. We’ve built some tooling on top of git to help with all of this.
In terms of size, we’re just getting to the size where it makes sense to start differentiating the engineering team. The team is mostly full stack engineers who build both front-end and back-end components. Each engineer does most of her work as part of a cross-functional team consisting of engineers, designers, analysts, and others. Our process is highly collaborative and dynamic, with new teams spinning up or winding down every couple weeks and people frequently moving between teams. How, then to keep the codebase in good shape if everyone is moving around? We have caretaker roles for each area who are responsible for the code quality in their domain and are automatically CC’d on all pull requests that touch their subtrees.
On the operations side of things, our philosophy is that on-call duties should be handled by applications engineers. Operations engineers are available to consult or as backup. Our philosophy is simple:
If you wrote it then it’s your job to support it.
This approach was chosen because our collective prior experience at other companies demonstrated that having Operations Engineers handle on-call duties created a misalignment of incentives between Operations and Engineering. The engineers making technical decisions did not experience the impact of their decisions, which caused the relationship between Operations and Engineering to degrade, sometimes to the point of confrontation. By having applications engineers handle on-call, we’ve been able to align incentives and work together more effectively.
At Medium, we use Holacracy as our system of organization, and it’s working quite well for us. Working together as a team, we’ve created a hierarchy of roles and policies that reflect how we can all do our best work. Holacracy’s governance process allows all of us to iterate on our structure in the same way we iterate on our software. For example, last week an engineer proposed we change how we manage our open source efforts, and this morning I proposed we move that role to a higher level of the organization to reflect its increasingly holistic nature.
In sharing these disparate aspects of our culture, I hope you’ll get a sense of how we’re a group of people mindfully working to create the kind of workplace and team we aspire to be. In ways big and small, we’re an ongoing work in progress and deeply aware of it!
Can you share more about your code review process?
Code reviews happen via GitHub’s pull request comments. At a high level each code review has three primary parties: the submitter, a lead reviewer (either an Engineer or a Tech Lead), and some caretakers.
The submitter gets their code ready, then runs our custom pull-request git command script. The script first looks for markers the build system leaves which indicate you’ve run the appropriate tests. This is basically a safety check, but that data goes into the pull request so the reviewers can know that you ran key tests. The script then looks for any caretakers responsible for the area(s) of the codebase you’ve changed, and automatically alerts them that a change is pending review. The submitter then selects a lead reviewer. Once all these steps are complete, the request is submitted to GitHub.
Once a pull request is submitted the submitter has to wait for two things. The first is an OK from the automatic integration system, which will attempt to merge the changes, and verify that a subset of tests run against the resulting build. The second is a “Looks Good To Me” or LGTM from the lead reviewer. With a handful of exceptions for very trivial changes it is required that all engineers get an LGTM before they may merge. Because this is a blocker engineers work hard to be responsive to any code review requests they receive, with some have gone so far as to offer an SLA.
Code reviews are public, and engineers who were not explicitly CC’d are also welcome to weigh in. A caretaker can comment on, and also block, a pull request in their area of the codebase at any time.
We didn’t start out with this process, in fact it’s taken over a year of evolution to get to this point. While it might sound a little involved, the tools takes care of most of the complexity. We’re pretty happy with it.
When you walked to the drawing board to design Medium, did you have certain design principles in mind?
When building a startup, I believe you’re building a company first and a technology stack second, which is how I approached this problem. Some of our early choices in the tech stack were partly optimized for the people we believed these choices would attract. Getting the right people onboard is much more important to an early stage company than getting an incremental improvement in something like performance.
As an example, I’ll take Node.js, which we chose for a number of reasons in addition to its technical merits:
- It would allow our engineers to work in the same language on both the client and server. The traditional argument for this is that it allows each engineer to work in more of the codebase, reducing bottlenecks in the engineering team. It also allows engineers previously restricted to either domain to start learning the other, which we thought would be attractive to open-minded, curious engineers.
- It was and is a near bleeding-edge choice, which is attractive to a certain type of risk-taking engineers. We felt these were the right type of engineers for an early stage startup.
- It has a vibrant, open, and welcoming community. We want all our members to feel welcome regardless of their gender, ethnicity, sexual orientation, or any other factor.
As an aside, I’m heartened to see the Rust community encoding this by adopting a Code of Conduct and hope the Node.js community does the same.
- As a new ecosystem, it has lots of opportunities for people to contribute. Many of the engineers we seek to hire are attracted to the opportunity to contribute to a new open source community as part of their day job.
What were some of the key architectural decisions you had to make as you set out to build Medium?
Going with key/value stores over relational databases was a big one. We felt that given the scale we expected for the product, many of the advantages relational databases would give us would eventually have to be discarded as we grew and had to shard the DB. That was a cost we decided to absorb early on.
The technology backing the editor was another big decision. We chose to build on an Operational Transformation model to give us maximum flexibility to support potential future features such as server side undo, offline editing, multiple people working in the same document, and track changes.
Did you get it right immediately, or were there things you needed to reconsider?
We certainly didn’t get things right the first time! One early example was that the Medium codebase was originally built on callbacks and later migrated to promises.
Another example of where we got it wrong is that early on, we went too far in embracing the schema-less nature of Amazon DynamoDB when we chose to put all data related to a high-level object such as a user into a single table. This proved troublesome because these different types of data had different access patterns, but they were now stored together. To correct this we added a schema system at the application layer, and now all rows in each DynamoDB table share the same schema. We have a few more tables and much more sanity.
What technology challenges are unique to Medium?
I think we have some interesting technical challenges both in the editor and in the recommendation engine.
Our editor is always under development as we continue to push the dual envelopes of usability and visual quality on the web. On the web client, we’ve had to invent a unique blend of Canvas and DOM manipulation to get some of our visual effects to run fast on all browsers.
The recommendation engine needs to be particularly good for our native mobile app, where each step through the stream requires the user to physically swipe the display, making poor recommendations more costly.
What do you consider some of your biggest technical accomplishments so far?
I think our editor is just a terrific piece of work. You can learn more about it in a post we just released on the Medium Engineering Blog.
What are some of the (UI/UX) design choices for Medium that have most significantly impacted server side needs?
When most people look at Medium they see a service oriented around text, but it’s actually the image backend that’s one of the biggest server side components. We try hard to give everyone an experience that just works while also looking great. This means we have to do a lot of on-the-fly image resizing, cropping, and filtering for different devices.
What are some of the things you can do because of the cloud that would otherwise be difficult?
While it’s true that working on a cloud such as AWS helps us scale dynamically to handle our widely varying traffic, that’s just the most obvious answer. I’ve come to see the cloud as the source of a very different benefit: a better engineering team.
Using the cloud forces good engineering discipline on a team, and it’s good for the culture. As a simple example: treating the entire infrastructure as disposable forces us to make everything repeatable, which is core to healthy engineering practice.
Having access to cloud resources is also helpful culturally because it reduces barriers to getting things built. Rather than having gatekeepers and waiting on requisition orders, an engineer can get started working with an environment that mirrors production right away. This helps teams feel empowered and keeps momentum up.
You’ve chosen to use some of our managed services like Amazon DynamoDB and Amazon Redshift. How do you decide whether to build and operate your own stack on EC2 vs using managed service?
As a rule, we’d rather not operate things that aren’t giving us value that’s unique to our business, but sometimes features are lacking and we have to anyway.
I see both DynamoDB and Redshift as technologies that sit squarely in the “little to no advantage for us in running it ourselves” quadrant. We considered both DynamoDB and Cassandra at the time. DynamoDB won primarily on operational costs.
An example of where we’ve gone the other way would be Elastic Load Balancing (ELB), which we don’t use because there are features such as SPDY support that we want and ELB doesn’t offer.
What is on your short list that we should build to serve you better?
There are a lot of things we would love to have you guys build for us.
As a heavy user of DynamoDB, I can name a number of features we’d like to see added to that product. Most importantly, hot keys have always been a problem for DynamoDB, and we’d love to see more thinking on possible solutions. Secondly, being able to alter indexes on DynamoDB tables after creation time is also a big need. Also helpful would be cross-region replication and a better backup system.
In terms of new products AWS could offer, my top choice would be a distributed scheduling system such as Chronos.
Looking at aspects which cut across product offerings, one thing I wish AWS did more of was invest in their client APIs across multiple languages. At Medium we use essentially no Java, and that’s often the first language you support with new features. I’m glad AWS recently launched an official Node.js SDK, but as of right now there’s no official Go SDK.
What is your opinion on how startups should think about managed services vs. building their own stack?
I think there is a trap waiting for startup people here. In my experience, people who work at early stage startups have a mix of extreme optimism and confidence. This is systemic: it’s what allows them to self-select into such a high-risk endeavor.
When this optimism and confidence is applied to building technology, lots of things that are dangerous for the business can happen. It’s exceptionally easy to believe you’ll build something better than what the market is offering. And you’re very likely to be wrong about that.
As a startup, your most precious resource is the time and attention you can give to your core product. Outside of any core technical differentiators that are central to your business, any time you are considering building your own stuff, you need to be extremely critical of your own reasoning.
What is your philosophy on build vs. buy?
In case it isn’t already obvious: for most of my career I was one of those guys who built one of everything and at some level believed that what I built was better than what other people built. As I’ve grown more experienced I’m much more protective of my time, my team’s time and our ability to focus on what differentiates us from other products. My default is buy unless there’s a compelling counter-argument.
Any final words of advice for startup CTOs?
You have a big job, and if it’s your first time in the role you’re going to need some help. Ask people in the industry for thirty minutes of their time, at their office. Be flexible on timing. Do at least sixty minutes of prep and use your time with the other person extremely well: either come in knowing what you want to ask or admit you don’t know where you’re at. Ask for what you need. Take notes, or ask to record the conversation so you can take notes later. Later, follow up with an email telling them how you applied what you learned to your business.
I’ve found that if you’re serious about improving and show it, you’ll find many people are open to helping you. You’ll get a lot out of it, and you’ll build a solid contact very quickly.
Special thanks to Kate Lee, Emily Leathers, Katherine Fellows, Sho Kuwamoto, Dan Pupius, Tess Rinearson, Stephanie Yeung, and Xiao Ma for reading drafts of this.