Andrew Fong spends an hour at the gym each day. It’s the one hour when he’s not carrying a pager.
For the past two years, he’s been building out the Dropbox site reliability team and setting up frameworks that will prevent that pager from ringing. The team has grown dramatically — not only in size but in perspective — since he started.
“We’re not fire fighting or reactive,” he says. “We’re looking three, six, nine, twelve months down the road as to how we can improve the infrastructure.”
As the team has grown, Andrew says the ability to make an impact has grown too.
“We’re able to go deep on certain areas and actually work on very fundamental challenges in the infrastructure, like provisioning, inventory management, traffic management.”
That depth of impact depends upon a blameless culture and a well-defined process for escalation if issues do arise. Andrew says he’s especially proud of the way his team has come together as it’s grown.
“They have each other’s backs,” he says.
In addition, the team is always looking for ways to reduce operational overhead.
“We’re trying to reduce the amount of time a human has to intervene in the system in order to fix it.”
Going forward, he’s focused on rallying the team around operational excellence, especially as the team continues to grow and SREs are embedded in other teams.
“That’s the thing I think is most important in the next set of steps,” he says, “to build a team that starts to imbibe that ethos across the company. That’s the value add that SRE brings to the rest of the organization.”
Andrew first got into SRE because of Linux. He’d always dabbled in computers, but while he was in college, he says, Linux exploded.
“You had to be good at SRE if you wanted to run your own machine.”
From there, he found a lot to love about SRE. For one thing, there’s the excitement of solving problems in real time. And in contrast to software development, you get to touch a lot more of the stack.
In SRE, he says, “you’re not necessarily just rooted in writing code. You’re worried about data center deployment and design. You’re worrying about network engineering. You’re worrying about other layers of the stack that you don’t necessarily get to touch as much on the software engineering side.”
After working in SRE at AOL and YouTube, he came to Dropbox for the opportunity to build large scale infrastructure from the ground up.
“We’re building infrastructure all the way from the bare metal machines through the data center to the software stack that goes on top of it to the actual Dropbox application.”
In building out his team, he looks for the ability to stay calm under pressure, as well as a logical, process-oriented approach to problem solving.
From his perspective, the best SREs are able not only to see the big picture, but to understand what happens when one piece of the picture drops out.
“Being able to do that is very hard,” he says, “because you have to have this complete knowledge and understanding — not necessarily of the intricacies of each piece, but understanding the interdependencies.”
For engineers interested in SRE, Andrew has one piece of advice: “Go lower in the stack than you think you need to go.”
The Dropbox tech ops team is hiring. We’d love for you to join us.