Learning about Site Reliability Engineering with the #100daysofSRE Challenge

I will post something I have learned almost each and every day until the 100 days of the challenge ends. #SRE #100daysofSRE

Shanto Roy
4 min readApr 11, 2023
#100daysofSRE challenge #100daysoflearning | Site Reliability Engineering | SRE Topics, aspects, roles, and responsibilities
Photo created using Canva

I will join as a Site Reliability Engineer intern in a company during the summer of 2023. I have previous skills and experience in Sysadmin and DevOps roles and responsibilities. However, even though SRE is similar in some ways, I found there are also some key differences as well.

And, there are very few resources that actually cover the topics and aspects of an SRE role. So, I devised a solution to add value by going through some aspects while learning more about it.

There are a lot of challenges like #100daysofcodes, #100daysofsecurity, and others. I really like this plan for two reasons: first, I will be obliged to complete the challenge; second, it will add some value to myself as well as to the community.

So, today, I asked chatGPT to give me a topic plan for the next 100 days. And here is the list of 100 topics I will cover for the next 100 days. Since I want to avoid flooding Medium with this long challenge, I will keep publishing posts on my own blog and adding the links here to the items on the list so that you can navigate the posts. I will post a weekly summary on Medium, though!

Update: I find many of the suggestions from chatGPT have overlapping issues. So, I decided to skip those overlapping topics (also, some can be merged into one), and I will add more technical content on the way. For now, I am adding TBA at the end.

You can also find the contents in this GitHub repository.

Contents for the Challenge

  1. Introduction to Site Reliability Engineering
  2. History of SRE and its Evolution
  3. SLAs, SLOs, and SLIs — understanding the metrics of reliability
  4. Chaos engineering and the benefits of breaking things on purpose
  5. Automation and its Role in SRE
  6. Incident management and response for SRE
  7. Effective communication during incidents
  8. Root cause analysis and post-incident reviews
  9. Monitoring and observability
  10. Grafana vs Splunk for Monitoring and Observability
  11. Logging and log analysis
  12. Alerting and notification strategies
  13. Capacity planning and management
  14. Load testing and stress testing
  15. Disaster recovery planning and testing
  16. High availability and redundancy strategies
  17. Failover and failback procedures
  18. Performance optimization and tuning
  19. Bottleneck analysis and resolution
  20. Troubleshooting techniques and strategies
  21. Continuous integration and deployment
  22. DevOps and SRE — similarities and differences
  23. Security and SRE — best practices and considerations
  24. Cloud infrastructure and SRE
  25. Infrastructure as code and SRE
  26. Kubernetes and SRE
  27. Service meshes and SRE
  28. Serverless computing and SRE
  29. Multi-cloud and hybrid cloud strategies for SRE
  30. Building resilience into systems and applications
  31. Scalability and elasticity in modern systems
  32. Microservices and SRE
  33. Event-driven architectures and SRE
  34. Distributed systems and SRE
  35. Fault tolerance and reliability in distributed systems
  36. CAP theorem and its implications for distributed systems
  37. Consistency models and distributed systems
  38. Replication and consensus algorithms in distributed systems
  39. Network reliability and resilience
  40. Network partitioning and the impact on reliability
  41. Network congestion and its impact on reliability
  42. Network latency and its impact on reliability
  43. Network topologies and their impact on reliability
  44. DNS and SRE — considerations and best practices
  45. CDN and SRE — considerations and best practices
  46. Storage and SRE — considerations and best practices
  47. Backup and restore strategies for data reliability
  48. Data consistency and durability in modern systems
  49. Database reliability and high availability
  50. Replication strategies and techniques for databases
  51. Database sharding and its impact on reliability
  52. CAP theorem and databases
  53. Monitoring and observability for databases
  54. Caching and SRE — considerations and best practices
  55. Load balancing and SRE — considerations and best practices
  56. API reliability and best practices
  57. Mobile app reliability and best practices
  58. Web app reliability and best practices
  59. Microservices reliability and best practices
  60. Cloud-native app reliability and best practices
  61. IoT device reliability and best practices
  62. Real-time systems reliability and best practices
  63. Machine learning model reliability and best practices
  64. Kubernetes reliability and best practices
  65. Service mesh reliability and best practices
  66. Multi-cloud reliability and best practices
  67. Hybrid cloud reliability and best practices
  68. Serverless reliability and best practices
  69. Monitoring and alerting for reliability
  70. Capacity planning and management for reliability
  71. Incident management and response for reliability
  72. Post-incident analysis and review for reliability
  73. Configuration management for reliability
  74. Change management for reliability
  75. Testing and verification for reliability
  76. Rollback and recovery for reliability
  77. Continuous integration and deployment for reliability
  78. Infrastructure as code for reliability
  79. Automation
  80. Continuous improvement and innovation for reliability
  81. Team building and culture for reliability
  82. Hiring and retaining SRE talent
  83. Managing remote SRE teams
  84. Stakeholder management for reliability
  85. Budgeting and cost optimization for reliability
  86. Metrics and reporting for reliability
  87. Communication and collaboration for reliability
  88. Influencing and leading without authority for reliability
  89. Developing an SRE career path
  90. SRE certifications and training programs
  91. SRE community and resources
  92. Emerging trends in SRE
  93. Building a Culture of Reliability in Organizations
  94. SRE maturity model and assessment
  95. SRE implementation and adoption strategies
  96. SRE and business value, customer experience, and competitive advantage
  97. TBA…
  98. TBA…
  99. TBA…
  100. TBA…
  101. Reflections on the 100-day challenge and next step

Concluding Remarks

Well, that’s quite a long list. I hope I will not miss days unless there is a good reason for it. And I hope you guys will be there throughout my journey and wish me the best.

Have a great day! Cheers!!!

Hi, there!!! 👋

Thanks for reading the full story! Before you go:

--

--

Shanto Roy

I write about Cyber Security, Python, DevOps/SRE, Research, AI, and travel. 💻 Tech blog👉 shantoroy.com ✈️ Travel Blog👉 digitalnomadgoals.com