Site Reliability Engineering — How to prepare for the interview

As an ever-growing company, Wildlife is one of the top 10 mobile game developers and publishers in the world, having released more than 60 games that have been played by billions of people around the globe.

To run that many games, we have many systems for game servers, chat rooms, clans features, in-game messaging, matchmaking, authentication, authorization, player support, and many others, all of them running on top of our many Kubernetes Clusters.

And to support this global, interconnected, gaming infrastructure we're building a world-class SRE team that not only aids our engineers in deploying highly scalable distributed systems but is also structuring an amazing cloud foundation to support the continued growth of the company for the next few years.

Site Reliability Engineering at Wildlife

As we showed in a previous post, the cloud infrastructure that runs our games plays a crucial role in their perceived quality. One of the main objectives of the SRE team is to provide a fantastic experience for our customers.

For internal clients such as Game Developers, Backend Developers, Data Scientists, Data Engineers, Game Artists, and others — we provide the necessary tools and automation to deploy our games in a fast, secure, and cost-effective way.

For our external clients, the players, we make sure they can play our games anytime, with reliable servers and with low latency connections.

What do we expect from a SRE at Wildlife Studios

To meet the needs of these many different clients, we always look for Engineers that are well versed in the common skills that you expect an infrastructure engineer to have, like understanding of how an Operational System works, proper Network, Security, and Protocols knowledge, and expertise on how to use the common Infrastructure as Code tools, like Terraform, Ansible, Packer to provide reliable cloud infrastructure.

In addition to that, we also look for people who are able to:

  • Develop in at least one programming language;

We understand that in the current Cloud-native world, we need to build a lot of tooling around major technologies like Kubernetes or HashiCorp’s Vault. Some examples are Flux Gitlab Controller and Helm Generate, tools developed by our SRE’s around FluxCD that enhance its capabilities and enable the interactions with other tools we use internally.

And as on the job title, reliability is an expertise that we expect from every SRE on the team. So it's important to know how to accurately monitor systems, how to choose appropriate SLO’s, and measure each SLI correctly, how to deploy scalable distributed applications, and in case everything goes south, have the proper troubleshooting and incident management skills to handle incidents in a calm, controlled and pragmatic way.

We also don’t expect you to know every tool out there or master any specific programming language. Technology, and mainly infrastructure, is always changing and evolving, so we firmly believe that if you have the proper fundamentals allied with good problem-solving skills and are passionate about infrastructure, you will be able to adapt to any stack that we're running at the moment.

Finally, as a team, we're curious, we want to work with great technologies, we want to test and research better approaches to everything we do. We want to solve problems that only happen on a global level scale, and we want to automate everything to keep any toil-related work low, so we can focus on the fun stuff!

Step by Step of our selection process

Our interview process is based on three operating principles:

  • A series of structured, clearly defined steps because we believe this makes it more reliable as a predictor of future performance and more inclusive in the sense of not leaving much room for unconscious bias;

Small details can vary by location, hiring manager, and level of seniority, but overall we invest a lot of effort into making sure our hiring standards are consistent, given how fundamental this is to build a high-performance distributed organization.

  1. Recruitment Strategy
    Once we’ve defined a new job position, we advertise it on our page and on other job boards that we consider relevant. Of course, we also actively look for professionals in different networks. Once we find someone that we wish to bring to the team, we get in touch and invite them to learn more about us and go through the recruiting process. This usually takes place on LinkedIn or GitHub, but other sources such as referrals from our employees or candidates are also a common practice.

What we evaluate in each interview:

  • Team Interview. The first interview is done with a potential future teammate. You will solve a number of problems that test your incident management and troubleshooting skills, your automation and infrastructure as code skills, and also your coding skills, in collaboration with the interviewer.

After each interview, the candidate is scored against the predefined criteria that we use to evaluate each skill that we considered.

5. Offer
After the interviews, all people involved meet to discuss the performance of the candidate and decide if we’ll extend a job offer. If approved, you'll be informed quickly. If you cannot respond immediately, we give you some time to take the offer home and think about it, hoping you’ll decide to join our team.

6. You’ve become Wilder!
And that’s it! Candidates who accept the offer become new Wilders and are welcomed with open arms by our team. In this final stage, we’ll arrange all the details for the starting date, and once you’ve started, the onboarding phase will provide you all the tools and context necessary to begin helping us tackle our biggest challenges!

How to prepare for the SRE interview

Generally, in the interviews, we're trying to assess your skills in specific areas and to be able to properly show them it’s a good idea to sharpen up your communication skills. It’s important to present your ideas clearly, in a structured, and cohesive way.

On the technical side, you can expect to be tested on the skills we discussed in this post, so it’s a good idea to review them. You can review your infrastructure skills by taking a look at guides like this one or the “Sysadmin interview questions”, and even though we don’t ask direct questions like those, they serve a list of topics that you can study.

Posts like this one and the “SRE interview preparation guide” on Github also contain a good overview of the expected SRE skills and can serve as good guidance. If you’re more of a book person, the Site Reliability Engineering and The Site Reliability Workbook are great literature to understand the fundamentals of the SRE job from a more high-level view.

Specifically, about the system design question, there are really good guides online on how to prepare for them, for example, the System Design Primer, the Crack the System Design Interview, and this small system design course. They are full of examples and tips on how to approach system design interviews, which should be really helpful even if you’re already used to this type of interview.

On coding questions, we don’t ask questions that require deep algorithmic knowledge, so don’t expect to be asked to code the algorithm to invert a binary tree. Because of that, books like “Cracking the Coding Interview” are useful to teach you how to walk through a problem, but not necessarily for the problems themselves.

We do consider algorithms knowledge useful in our day to day job, but our interview questions are much more focused on implementing small features or single functions of a bigger project. It’s expected that you can deliver working code in at least one programming language, following proper good coding practices.

It’s also not a bad idea to get familiarized with our company, and it’s games. We can learn about our values on this page. Try downloading Tennis Clash, Zooba, or Sniper 3D. Look for them in the App Store, Google Play Store, and on our website. And also, read about our history, check some of our other technical blog posts here at medium, or check some of our open-source projects on Github.

I hope this post was informative and interesting to you, and that it fulfilled its purpose to showcase what we expect from an SRE at Wildlife and how is our recruiting process. If you want to join the team, check out our open positions!

Senior Software Engineer

Senior Site Reliability Engineer

Wildlife Studios Tech Blog

Wildlife Studios is building next-generation mobile games

Wildlife Studios Tech Blog

Wildlife Studios is building next-generation mobile games, and it takes a lot of data, innovation, and knowledge. Our tech people are here to share how they are building the best in class technology to improve people's life with fun and innovation.

Douglas Quintanilha

Written by

Coffee lover SRE at Wildlife Studios

Wildlife Studios Tech Blog

Wildlife Studios is building next-generation mobile games, and it takes a lot of data, innovation, and knowledge. Our tech people are here to share how they are building the best in class technology to improve people's life with fun and innovation.