I’m excited that you are interested in learning more about being an SRE. I am in an Engineering Leadership role at Dropbox where I work with SREs on our Databases and Magic Pocket Infrastructure Teams. We have over 500 petabytes of data, over 500 million customers and very small teams.
This list of resources was collected by Krishelle and I. Krishelle recently joined us at Dropbox after graduating from Hackbright Academy. Read more about Krishelle’s career change story and becoming an SRE over on Huff Post.
SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.
Here are some handy links and resources to get you started learning more and picking up skills that will help you for a career as an SRE.
Don’t go it alone, join the community
There are many communities you can join on your journey to becoming an SRE. I recommend joining the following Slack Communities I’m also a member of:
- Chaos Engineering Slack Community: tinyurl.com/chaoseng
SRE @ Dropbox
- Site Reliability Engineering @ Dropbox https://youtu.be/ggizCjUCCqE
- How we have scaled Dropbox https://youtu.be/PE4gwstWhmc
- Go @ Dropbox https://youtu.be/JOx9enktnUM
- Database Monitoring @ Dropbox https://vimeo.com/173607649
- Dropbox Databases Infrastructure https://youtu.be/71VryWiEw2A
- Adventures in MySQL @ Dropbox https://youtu.be/xFoA5wWpl0s
- Bridging the Safety Gap from Scripts to Full Auto-Remediation https://www.usenix.org/conference/srecon16europe/program/presentation/mah
Git Fundamentals Tutorials
- Introduction to Git: Installation, Usage, and Branches https://www.digitalocean.com/community/tutorial_series/introduction-to-git-installation-usage-and-branches
Code Review Best Practice Tips
- Read the best code you can find in your company and check out our Dropbox Open Source projects, e.g Marshal. You will learn a ton reading other people’s code.
- Small and frequent over large and rare. Many open source projects will reject your change if it’s over 100 lines. That’s a ton of code to read, review, spot issues and recommend changes to if needed. Keep your changes small and get feedback often. Some of the engineers on my team will often do 1–3 line diffs. http://blogs.atlassian.com/2010/03/code_review_in_agile_teams_part_ii/
- Why code reviews matter (and actually save time!) https://www.atlassian.com/agile/code-reviews
- How to use vim for advanced editing of plain text or code on a Virtual Private Server https://www.digitalocean.com/community/tutorials/how-to-use-vim-for-advanced-editing-of-plain-text-or-code-on-a-vps--2
- VIM adventures game http://vim-adventures.com/
Linux Fundamentals Tutorials
- Initial Server Setup with Ubuntu 16.04 https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-16-04
- Getting Started with Linux https://www.digitalocean.com/community/tutorial_series/getting-started-with-linux
- How to write a simple shell script on a VPS https://www.digitalocean.com/community/tutorials/how-to-write-a-simple-shell-script-on-a-vps
- SSH essentials https://www.digitalocean.com/community/tutorials/ssh-essentials-working-with-ssh-servers-clients-and-keys
- How to install and setup a local programming environment for Python https://www.digitalocean.com/community/tutorial_series/how-to-install-and-set-up-a-local-programming-environment-for-python-3
- How to crawl a web page with scrapy and Python https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
- How To Create a Twitter app with Python https://www.digitalocean.com/community/tutorials/how-to-create-a-twitter-app
- How to make a simple calculator program in Python https://www.digitalocean.com/community/tutorials/how-to-make-a-simple-calculator-program-in-python-3
- How to install Go 1.6 on Ubuntu 16.04 https://www.digitalocean.com/community/tutorials/how-to-install-go-1-6-on-ubuntu-16-04
- How To Use Martini to Serve Go Applications Behind an Nginx Server on Ubuntu https://www.digitalocean.com/community/tutorials/how-to-use-martini-to-serve-go-applications-behind-an-nginx-server-on-ubuntu
- How to install Go and Revel on an Ubuntu Virtual Private Server https://www.digitalocean.com/community/tutorials/how-to-install-go-and-revel-on-an-ubuntu-13-04-x64-vps
- Write Simple tests that you want to maintain in the future and that others would want to maintain too.
- Use Travis CI for your GitHub projects. Easily sync your GitHub projects with Travis CI and you’ll be testing your code in minutes. https://travis-ci.org/. Knowledge of git version control and how to use github are pre-requisites for using travis.
- Community-generated video for getting started: https://www.youtube.com/watch?v=BOIJjfFoRdc. 1)how to have a github repo set up for your user (we also can do per organization) 2)setting up a repo to use travis ci on github 3)turning on repositories on travis-ci.org 4)setting your user token for travis-ci on github 5) adding the .travis.yml configuration file in the root of your github repository 6) `git add` `git commit` `git push` and `git merge` to see the magic happen!
- Here are some python-specific community-generated articles https://www.smartfile.com/blog/testing-python-with-travis-ci/ https://www.newfies-dialer.org/continuous-integration-with-travis-ci/
- Now the documentation: https://docs.travis-ci.com/user/for-beginners
- Python specific configuration to use in your .travis.yml https://docs.travis-ci.com/user/languages/python/
- Mysql-specific configuration to add in your .travis.yml https://docs.travis-ci.com/user/database-setup/#MySQL
- Nginx addons for webserver http://www.koszek.com/blog/2015/11/01/nginx-on-travis-ci/
- Finally, the entire build lifecycle and how to customize to your hearts’ content https://docs.travis-ci.com/user/customizing-the-build/
Databases Tutorials (MySQL & Percona)
MySQL is used by Dropbox, Facebook, Slack, Google and many more. Most don’t only use MySQL, such as Dropbox but it is used and therefore useful to understand.
- A basic MySQL Tutorial https://www.digitalocean.com/community/tutorials/a-basic-mysql-tutorial
- How to setup replication (primary — replica) https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-in-mysql
- How to setup replication (primary — primary) https://www.digitalocean.com/community/tutorials/how-to-set-up-mysql-master-master-replication
- How To Install a Fresh Percona Server or Replace MySQL https://www.digitalocean.com/community/tutorials/how-to-install-a-fresh-percona-server-or-replace-mysql
- How To Create Hot Backups of MySQL Databases with Percona XtraBackup on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-create-hot-backups-of-mysql-databases-with-percona-xtrabackup-on-ubuntu-14-04
Want to try a NoSQL Database?
The Netflix member experience is offered to 83+ million global members, and delivered using thousands of microservices. Netflix uses Cassandra, you can read more about it here:
- NoSQL at Netflix http://techblog.netflix.com/2011/01/nosql-at-netflix.html
- Benchmarking High Performance I/O with SSD for Cassandra on AWS http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html
- Netflix created their own benchmarking tool for cloud data stores. https://github.com/Netflix/ndbench
Now it’s time to setup your own Cassandra cluster using cloud infrastructure!
- How to install Cassandra an run a single node cluster on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-install-cassandra-and-run-a-single-node-cluster-on-ubuntu-14-04
- How to run a multi-node cluster database with Cassandra on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-run-a-multi-node-cluster-database-with-cassandra-on-ubuntu-14-04
Production Web Application Tutorials
- Building for production web applications This 6-part tutorial will show you how to build out a multi-server production application setup from scratch. The final setup will be supported by backups, monitoring, and centralized logging systems, which will help you ensure that you will be able to detect problems and recover from them. The ultimate goal of this series is to build on standalone system administration concepts, and introduce you to some of the practical considerations of creating a production server setup. https://www.digitalocean.com/community/tutorial_series/building-for-production-web-applications
- How To Set Up an Apache, MySQL, and Python (LAMP) Server Without Frameworks on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-set-up-an-apache-mysql-and-python-lamp-server-without-frameworks-on-ubuntu-14-04
- An Introduction to Networking Terminology, Interfaces, and Protocols https://www.digitalocean.com/community/tutorials/an-introduction-to-networking-terminology-interfaces-and-protocols
Distributed Systems Resources
- Introduction to Distributed System Design http://www.hpcs.cs.tsukuba.ac.jp/~tatebe/lecture/h23/dsys/dsd-tutorial.html
- Blog on building bigger, faster, more reliable websites http://highscalability.com/
Monitoring and Logging
- Building for Production: Web Applications — Centralized Logging https://www.digitalocean.com/community/tutorials/building-for-production-web-applications-centralized-logging
- How To Gather Infrastructure Metrics with Packetbeat and ELK on Ubuntu 16.04 https://www.digitalocean.com/community/tutorials/how-to-gather-infrastructure-metrics-with-packetbeat-and-elk-on-ubuntu-16-04
- How To Install Nagios 4 and Monitor Your Servers on Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-install-nagios-4-and-monitor-your-servers-on-ubuntu-14-04
- How To Use Icinga To Monitor Your Servers and Services On Ubuntu 14.04 https://www.digitalocean.com/community/tutorials/how-to-use-icinga-to-monitor-your-servers-and-services-on-ubuntu-14-04
- “How long will it take you to build this?” — A very hard question when you are starting out. Also hard even with years of experience. Try and figure out how long it takes you to do something by measuring your time spent. You will be able to know your engineering pace and as you improve you will see your pace adjust. https://www.atlassian.com/agile/estimation
- Improve Your Project Estimation Skill. — Having a good project estimation skill will help us accurately calculate the “time invested” for each project. Read the Effective Engineer to learn more.
- Allow buffer room for the unknown in the schedule. Take into account competing work obligations, holidays, illnesses, etc. The longer a project, the higher the probability that some of these will occur.
- Define measurable milestones. Clear milestones can alert us as to whether we’re on track or falling behind. Use them as opportunities to revise our estimates.
Prepping for Tech Interviews
- Deploy your project on multiple services (Amazon AWS, DigitalOcean, etc)
- Get the GitHub Student Backpack if you are able to, it has tons of savings: https://education.github.com/pack
- Rewrite your project in a new framework/language or try and use a switch to a different database
- Check out http://girlgeekacademy.com/ for upcoming events
- Do a Databases Interview Preparation Course: https://www.go1.com/#!/course/databases-interview-preparation/980537
- Go to a hackathon like She Hacks
- Whiteboard through coding questions in “Cracking The Coding Interview” with friends and deep dive into your projects or practice troubleshooting examples.
Recommended Books & Papers
- Site Reliability Engineering http://shop.oreilly.com/product/0636920041528.do
- Cracking the Tech Career https://www.amazon.com/Cracking-Tech-Career-Insider-Microsoft/dp/1118968085
- Cracking the Coding Interview https://www.amazon.com/Cracking-Coding-Interview-Programming-Questions/dp/098478280X
- The Effective Engineer http://www.theeffectiveengineer.com/
- The Go Programming Language https://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/dp/0134190440
- Automating the boring stuff with Python https://automatetheboringstuff.com/
- TCP/IP Guide http://shop.oreilly.com/product/9781593270476.do
- Understanding Linux Network Internals http://shop.oreilly.com/product/9780596002558.do
- Security for web developers http://shop.oreilly.com/product/0636920041429.do
- How complex systems can fail http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
- Ladies who Linux https://www.meetup.com/Ladies-Who-Linux-Bay-Area/
- Women Who Go www.meetup.com/Women-Who-Go/
- PyLadies https://www.meetup.com/PyLadiesSF/
The first SRE
Margaret Hamilton working on the Apollo program on loan from MIT, had all the significant traits of the first SRE. In her own words, “part of the culture was to learn from everyone and everything, including from that which one would least expect” — Site Reliability Engineering, How Google Runs Production Systems.
Best of luck on this exciting journey!
Good Luck, Tammy & Krishelle