GitHub: Scaling on Ruby, with a nomadic tech team

Source: GitHub

Sam Lambert joined GitHub in 2013 as the company’s first database administrator, and is now the company’s director of technology. In this interview, he discusses how the service — which now boasts more than 10 million users and 25 million projects — is able to keep on scaling with a relatively simple technology stack. He also talks about GitHub’s largely officeless workplace — about 60 percent of its employees work remotely, using a powerful homemade chatbot, called Hubot, to collaborate.


SCALE: I usually think of GitHub as more of a technology provider and less of a technology user, but that’s probably unfair. Can you walk me through the technology and philosophies that underpin GitHub?

SAM LAMBERT: We take a very Unix philosophy to how we develop software and services internally. We like to be continually proud of the simplicity of a lot of our infrastructure. We really do try and shy away from complexity and over-engineering. We like to make more pragmatic choices about how we work and what we work on.

For a long time, very key bits of our infrastructure were strung together with Shell scripts and simple scripting, and it’s surprisingly effective and still works really very well for us.

What does that result in, in terms of your technology stack?

The core of what you see and use as a GitHub user is a Ruby on Rails application. It’s seven-year-old app now, created by founders when they started the company. That’s the core of the application, but obviously there’s a ton of Git in the stack. We have custom C daemons that do things like proxy, Git requests, and data aggregation.

MySQL is our core data store that we used for storing all data that powers the site as well as the metadata around the users. We also use Redis a little bit for some non-persistent caching, and things like memcached.

C, Shell, Ruby — quite a simple, monolithic stack. We’re really not an overcomplex shop, we don’t intend to try and drop new languages for every small project.

We’ve got core Ruby committers that work for us, and that allows us to scale what we have and keep a pragmatic view on all our technology choices and try to keep our stack smaller. Really, there’s nothing much you can’t do with the stack that we’ve already chosen. To keep at this game and keep it moving, we just have to keep applying varied techniques to what we’ve got.

That’s somewhat ironic considering all the projects and experiments that hosted on GitHub. Do you ever see new things and get tempted to change things up?

We certainly take a look at new technologies. Our employees have a large amount of freedom in what they do, and people will try all sorts of stuff and experiment. Often, it’s just to know why you’re not using technology. You could look at something, understand why it’s interesting, what the problems are that you’re trying to solve, maybe then take some of the approaches to extend what you’re already doing. Or maybe put it on the shelf for a little while while it matures.

But there is an interesting irony in that half the new projects in the world happen on GitHub and we tend to stick with a fairly conservative stack. Our CTO often jokes about when I was interviewed by him to join the company, as the first DBA at GitHub. I actually said in my interview, “I’m really surprised to be sitting here. I assumed GitHub is using some sort of new, hip datastore.” Then as the interview process went along, it was more revealed to me that this is actually a really pragmatic set of hackers that just hack on Ruby, hack on C and spend their time working on more interesting things using a more stable stack, rather than chasing after the latest and shiny tech.


Keeping up with all that Git

What are the challenges that keep your team busy?

A lot of it is volume. Obviously, our user base is growing. We also have a very technical user base and they seem to manage to find ways to use the API in an obscure manner. Using a standard framework, there’s a lot of stuff that you don’t get to see the extremes of until it’s a large use case. There’s a lot of patterns that Rails uses that are less optimal at a large scale. We might hit issues like that and have to rewrite certain bits of functionality.

Obviously, we also have a massive amount of Git. Scaling something like Git in a backend infrastructure is quite different. It’s not something that anyone else is trying to achieve. We’re actually on the frontier when it comes to scaling Git, the application itself, at our scale, which is fascinating.

We have an amazing team that works on that and works really hard to build in the extra functionality. We’re like a Git host in someone’s infrastructure, which means all sorts of work to balance public versus private repositories, and to make sure authentications and permissions are correct when users try to access code.

“For a Rails app, github.com is a really, really quick site and we have a motto that ‘It’s not shipped until it’s fast.’”

So Git is the most unique aspect of your technical operations?

Absolutely. We don’t want to be unique in any other sense than what we’re known for. That’s also something we’re really proud of. I often say to people, “Let’s only write bespoke elements of our architecture that make sense for a company that stores Git data.”

We don’t need to reinvent the wheel, we don’t need to write our own databases, we don’t need to start writing our own frameworks — because they’re all in domains that are usual. It’s a website, it’s web hosting. In the domains that are unusual, we fully embrace the need to write custom applications or build bespoke apps for that.

What’s the metric that drives engineering most in terms of what your team works on?

We’re quite obsessed with performance. We want to make sure the site is always performant and continually fast. For a Rails app, github.com is a really, really quick site and we have a motto that “It’s not shipped until it’s fast.”

In terms of a specific metric that we keep our eye on, it’s capacity for storing Git. That’s something that we have to continue to grow. As our usage spikes more and more, we’re starting to see more pressure on that kind of infrastructure. We have some really interesting projects being worked on at the moment that will let us keep scaling.

“Quite often with scaling problems, they just come around the corner. They don’t just slowly, gradually appear.”

No cloud computing here

What does the underlying infrastructure for GitHub look like? Are you in the cloud or on local servers?

We host in our own datacenters. We actually have an amazing provisioning story. We basically can provision hardware like it was the cloud. We have a really small, but amazingly dedicated, physical infrastructure team, and they do phenomenal work in providing us these amazing services that we can use.

If I need a new host, I can basically tell our chatbot, Hubot, that I need X amount of host of this class on these chassis, and it will just build them and deploy back in minutes. We have this incredibly flat, flexible, but physical infrastructure. As someone who consumes that infrastructure, it’s phenomenal and to watch it working is brilliant.

It sounds like you’re avoiding the laborious and time-consuming procurement step often associated with physical gear.

We have some slack capacity for hardware, essentially. The physical infrastructure team will provision machines that are empty and ready to be provisioned by the teams that are going to use them. For example, the database infrastructure team can look at how many machines are in the pool of the class they need for databases, and essentially they’ve written their own Puppet roles and classifications about how those nodes work. Then they can just provision them themselves or just tag them so that it’s capacity for that team.

Are you scaling the server footprint on a regular basis, or is it a pretty controlled growth pattern at this point?

We keep it controlled in terms of how we order and how we provision. But in terms of usage of the site, is trending up at an increasing rate. More and more companies in the world are realizing that they’re tech companies, so the usage growth of GitHub is just going up and up and up.

We’re handling it well, though. Quite often with scaling problems, they just come around the corner. They don’t just slowly, gradually appear. They come quickly and we tackle them as they happen, and we have some interesting use cases at times. People bring strange things to the site and they’ll reveal slight scaling problems, but we have an application we can develop on quickly and people that understand the domain well. We get over those problems fairly fast and deploy fixes and continue going.

“I’ve got colleagues that don’t have a permanent location. They just fly from city to city and work from wherever. They’re just nomads and they’re all around the world.”

Building a global engineering network

Are there issues that keep you up at night, or long-term concerns always in the back of your mind?

Defending our infrastructure is certainly something that we always think about. I wouldn’t say it keeps you up at night, but it’s something we certainly think about.

And scaling our organization. The bigger we get, the more engineers we need, but we need to keep that growth in line with our culture. I think hiring is a challenge that every tech company has — continuing to get good, talented employees from different backgrounds from around the world. But you’ve got to find people that have your same engineering values and that like to work on things similar to what you have and what you can offer.

I’m also concerned with continuing to embrace the distributed nature of the company. The company is 60 percent remote currently; I’m in England at the moment. I’ve traveled around the world, working from different places. That’s something that is completely possible, based on our culture and distributed nature.

That seems pretty unique. Are those 60 percent of employees working from home or from branch offices?

They’re working from home. You can work from anywhere. Last year I worked probably in five or six different cities around the world. Just working from my laptop wherever we decided to go. A month ago I was working from a cabin in the woods in Wisconsin.

I’ve got colleagues that don’t have a permanent location. They just fly from city to city and work from wherever. They’re just nomads and they’re all around the world. That’s something we can offer to people that’s baked into our culture.

There’s no requirement to work in any office. Our office is actually more of a social space. We have areas for people to meet and enjoy being together, but there’s no necessity to be together.

Myself and a colleague shipped a gigantic refactor of our backend, essentially transforming it from a monolithic environment to a distributed one. We had a lot of decisions to make — a lot of re-factoring and new patterns. In that entire project we never had a face-to-face conversation. We just worked together through chat and issuing pull requests, and we then met each other at the end of that project — about 6 months into him joining GitHub.

Again, that’s just the way we work. That’s the way our culture works. There’s no requirement to be physically located in order to be productive and do what we do.

“For a long, long time your on-boarding was joining our chatroom watching what other people were doing. … I joined the company and I just idled in chat and just watched how people worked and what they did and I just learned that way.”

Hubot to the rescue

You mentioned Hubot earlier as the provisioning tool, but is there more to it? It sounds key to how the company is able to function with such a distributed workforce.

Hubot can do basically everything in GitHub. You can ask Hubot where a specific member of staff is and it will show you where they are in the world or what floor they’re on in one of our offices, for example.

There’s probably about 40 different provisioning commands. You can do a MySQL stack. You can do failovers, you can drop tables, you can backup tables, you can clone, you can run migrations, you can do everything. You can do mitigation of attacks.

Basically, everything you could ever possibly imagine to do in our infrastructure, you can do via Hubot. There are zero requirements to interface with any code. You can run it all through Hubot.

A humorous schematic for Hubot. Source: GitHub

So it’s a lot more than an automation engine …

Yeah, it’s automation, but it’s a lot more. It’s the context as well.

For a long, long time your on-boarding was joining our chatroom watching what other people were doing. Because we’re not physically located anywhere, when an issue comes up, you see the alert coming to chat, you can start pulling up graphs that everyone in the room can see, and everyone can see what you’re looking at. I joined the company and I just idled in chat and just watched how people worked and what they did and I just learned that way.

I try and reflect back on how previous teams I’ve worked on would debug stuff. Everyone would be in their own terminal or on their own dashboard looking at graphs and then trying to awkwardly share them with each other, or paste terminal outputs for example.

With chat, you just dive in and the context is all there. You’ve got this, basically, gigantic shared console for our company. For example, if you get an alert about a database failure and a couple people jump in, you can see that it’s already been diagnosed and already been worked on. There’s no duplication of effort and the people that need to know start getting context directly. When we go into large incidents (touch wood they don’t happen often), we’re able to really collaboratively work together.

If we tweet via Hubot out to our status page or out to the updates page, people can double-check what you’re going to write.

It’s just a fully collaborative experience, and it’s something that more and more companies are taking up. You hear of massive companies integrating Hubot to do these fantastic use cases. It’s just amazing to watch, really.

The old way of working, I just don’t think I could go back to anymore. I’m so used to being among all my colleagues and my team, collaborating on what we’re trying to do through chat. It’s a whole new way of working that adds so much and solves so many problems that I think a lot of traditional companies haven’t solved.