Learning about Site Reliability Engineering with the #100daysofSRE Challenge

I will post something I have learned almost each and every day until the 100 days of the challenge ends. #SRE #100daysofSRE

4 min readApr 11, 2023

#100daysofSRE challenge #100daysoflearning | Site Reliability Engineering | SRE Topics, aspects, roles, and responsibilities — Photo created using Canva

I will join as a Site Reliability Engineer intern in a company during the summer of 2023. I have previous skills and experience in Sysadmin and DevOps roles and responsibilities. However, even though SRE is similar in some ways, I found there are also some key differences as well.

And, there are very few resources that actually cover the topics and aspects of an SRE role. So, I devised a solution to add value by going through some aspects while learning more about it.

There are a lot of challenges like #100daysofcodes, #100daysofsecurity, and others. I really like this plan for two reasons: first, I will be obliged to complete the challenge; second, it will add some value to myself as well as to the community.

So, today, I asked chatGPT to give me a topic plan for the next 100 days. And here is the list of 100 topics I will cover for the next 100 days. Since I want to avoid flooding Medium with this long challenge, I will keep publishing posts on my own blog and adding the links here to the items on the list so that you can navigate the posts. I will post a weekly summary on Medium, though!

Update: I find many of the suggestions from chatGPT have overlapping issues. So, I decided to skip those overlapping topics (also, some can be merged into one), and I will add more technical content on the way. For now, I am adding TBA at the end.

You can also find the contents in this GitHub repository.

Contents for the Challenge

Introduction to Site Reliability Engineering
History of SRE and its Evolution
SLAs, SLOs, and SLIs — understanding the metrics of reliability
Chaos engineering and the benefits of breaking things on purpose
Automation and its Role in SRE
Incident management and response for SRE
Effective communication during incidents
Root cause analysis and post-incident reviews
Monitoring and observability
Grafana vs Splunk for Monitoring and Observability
Logging and log analysis
Alerting and notification strategies
Capacity planning and management
Load testing and stress testing
Disaster recovery planning and testing
High availability and redundancy strategies
Failover and failback procedures
Performance optimization and tuning
Bottleneck analysis and resolution
Troubleshooting techniques and strategies
Continuous integration and deployment
DevOps and SRE — similarities and differences
Security and SRE — best practices and considerations
Cloud infrastructure and SRE
Infrastructure as code and SRE
Kubernetes and SRE
Service meshes and SRE
Serverless computing and SRE
Multi-cloud and hybrid cloud strategies for SRE
Building resilience into systems and applications
Scalability and elasticity in modern systems
Microservices and SRE
Event-driven architectures and SRE
Distributed systems and SRE
Fault tolerance and reliability in distributed systems
CAP theorem and its implications for distributed systems
Consistency models and distributed systems
Replication and consensus algorithms in distributed systems
Network reliability and resilience
Network partitioning and the impact on reliability
Network congestion and its impact on reliability
Network latency and its impact on reliability
Network topologies and their impact on reliability
DNS and SRE — considerations and best practices
CDN and SRE — considerations and best practices
Storage and SRE — considerations and best practices
Backup and restore strategies for data reliability
Data consistency and durability in modern systems
Database reliability and high availability
Replication strategies and techniques for databases
Database sharding and its impact on reliability
CAP theorem and databases
Monitoring and observability for databases
Caching and SRE — considerations and best practices
Load balancing and SRE — considerations and best practices
API reliability and best practices
Mobile app reliability and best practices
Web app reliability and best practices
Microservices reliability and best practices
Cloud-native app reliability and best practices
IoT device reliability and best practices
Real-time systems reliability and best practices
Machine learning model reliability and best practices
Kubernetes reliability and best practices
Service mesh reliability and best practices
Multi-cloud reliability and best practices
Hybrid cloud reliability and best practices
Serverless reliability and best practices
Monitoring and alerting for reliability
Capacity planning and management for reliability
Incident management and response for reliability
Post-incident analysis and review for reliability
Configuration management for reliability
Change management for reliability
Testing and verification for reliability
Rollback and recovery for reliability
Continuous integration and deployment for reliability
Infrastructure as code for reliability
Automation
Continuous improvement and innovation for reliability
Team building and culture for reliability
Hiring and retaining SRE talent
Managing remote SRE teams
Stakeholder management for reliability
Budgeting and cost optimization for reliability
Metrics and reporting for reliability
Communication and collaboration for reliability
Influencing and leading without authority for reliability
Developing an SRE career path
SRE certifications and training programs
SRE community and resources
Emerging trends in SRE
Building a Culture of Reliability in Organizations
SRE maturity model and assessment
SRE implementation and adoption strategies
SRE and business value, customer experience, and competitive advantage
TBA…
TBA…
TBA…
TBA…
Reflections on the 100-day challenge and next step

Concluding Remarks

Well, that’s quite a long list. I hope I will not miss days unless there is a good reason for it. And I hope you guys will be there throughout my journey and wish me the best.

Have a great day! Cheers!!!

Hi, there!!! 👋

Thanks for reading the full story! Before you go:

Please, 👏 Clap for the story and follow me 👉
📰 Read more content on my blog 👉
🙋‍♂️ Find me: LinkedIn | Google Scholar | Instagram | YouTube

Learning about Site Reliability Engineering with the #100daysofSRE Challenge

I will post something I have learned almost each and every day until the 100 days of the challenge ends. #SRE #100daysofSRE

Contents for the Challenge

Concluding Remarks

Hi, there!!! 👋

Written by Shanto Roy