Learning about Site Reliability Engineering with the #100daysofSRE Challenge
I will post something I have learned almost each and every day until the 100 days of the challenge ends. #SRE #100daysofSRE
I will join as a Site Reliability Engineer intern in a company during the summer of 2023. I have previous skills and experience in Sysadmin and DevOps roles and responsibilities. However, even though SRE is similar in some ways, I found there are also some key differences as well.
And, there are very few resources that actually cover the topics and aspects of an SRE role. So, I devised a solution to add value by going through some aspects while learning more about it.
There are a lot of challenges like #100daysofcodes, #100daysofsecurity, and others. I really like this plan for two reasons: first, I will be obliged to complete the challenge; second, it will add some value to myself as well as to the community.
So, today, I asked chatGPT to give me a topic plan for the next 100 days. And here is the list of 100 topics I will cover for the next 100 days. Since I want to avoid flooding Medium with this long challenge, I will keep publishing posts on my own blog and adding the links here to the items on the list so that you can navigate the posts. I will post a weekly summary on Medium, though!
Update: I find many of the suggestions from chatGPT have overlapping issues. So, I decided to skip those overlapping topics (also, some can be merged into one), and I will add more technical content on the way. For now, I am adding TBA at the end.
You can also find the contents in this GitHub repository.
Contents for the Challenge
- Introduction to Site Reliability Engineering
- History of SRE and its Evolution
- SLAs, SLOs, and SLIs — understanding the metrics of reliability
- Chaos engineering and the benefits of breaking things on purpose
- Automation and its Role in SRE
- Incident management and response for SRE
- Effective communication during incidents
- Root cause analysis and post-incident reviews
- Monitoring and observability
- Grafana vs Splunk for Monitoring and Observability
- Logging and log analysis
- Alerting and notification strategies
- Capacity planning and management
- Load testing and stress testing
- Disaster recovery planning and testing
- High availability and redundancy strategies
- Failover and failback procedures
- Performance optimization and tuning
- Bottleneck analysis and resolution
- Troubleshooting techniques and strategies
- Continuous integration and deployment
- DevOps and SRE — similarities and differences
- Security and SRE — best practices and considerations
- Cloud infrastructure and SRE
- Infrastructure as code and SRE
- Kubernetes and SRE
- Service meshes and SRE
- Serverless computing and SRE
- Multi-cloud and hybrid cloud strategies for SRE
- Building resilience into systems and applications
- Scalability and elasticity in modern systems
- Microservices and SRE
- Event-driven architectures and SRE
- Distributed systems and SRE
- Fault tolerance and reliability in distributed systems
- CAP theorem and its implications for distributed systems
- Consistency models and distributed systems
- Replication and consensus algorithms in distributed systems
- Network reliability and resilience
- Network partitioning and the impact on reliability
- Network congestion and its impact on reliability
- Network latency and its impact on reliability
- Network topologies and their impact on reliability
- DNS and SRE — considerations and best practices
- CDN and SRE — considerations and best practices
- Storage and SRE — considerations and best practices
- Backup and restore strategies for data reliability
- Data consistency and durability in modern systems
- Database reliability and high availability
- Replication strategies and techniques for databases
- Database sharding and its impact on reliability
- CAP theorem and databases
- Monitoring and observability for databases
- Caching and SRE — considerations and best practices
- Load balancing and SRE — considerations and best practices
- API reliability and best practices
- Mobile app reliability and best practices
- Web app reliability and best practices
- Microservices reliability and best practices
- Cloud-native app reliability and best practices
- IoT device reliability and best practices
- Real-time systems reliability and best practices
- Machine learning model reliability and best practices
- Kubernetes reliability and best practices
- Service mesh reliability and best practices
- Multi-cloud reliability and best practices
- Hybrid cloud reliability and best practices
- Serverless reliability and best practices
- Monitoring and alerting for reliability
- Capacity planning and management for reliability
- Incident management and response for reliability
- Post-incident analysis and review for reliability
- Configuration management for reliability
- Change management for reliability
- Testing and verification for reliability
- Rollback and recovery for reliability
- Continuous integration and deployment for reliability
- Infrastructure as code for reliability
- Automation
- Continuous improvement and innovation for reliability
- Team building and culture for reliability
- Hiring and retaining SRE talent
- Managing remote SRE teams
- Stakeholder management for reliability
- Budgeting and cost optimization for reliability
- Metrics and reporting for reliability
- Communication and collaboration for reliability
- Influencing and leading without authority for reliability
- Developing an SRE career path
- SRE certifications and training programs
- SRE community and resources
- Emerging trends in SRE
- Building a Culture of Reliability in Organizations
- SRE maturity model and assessment
- SRE implementation and adoption strategies
- SRE and business value, customer experience, and competitive advantage
- TBA…
- TBA…
- TBA…
- TBA…
- Reflections on the 100-day challenge and next step
Concluding Remarks
Well, that’s quite a long list. I hope I will not miss days unless there is a good reason for it. And I hope you guys will be there throughout my journey and wish me the best.
Have a great day! Cheers!!!