Learning about Site Reliability Engineering with the #100daysofSRE Challenge
I will post something I have learned almost each and every day until the 100 days of the challenge ends. #SRE #100daysofSRE
I am currently working as a Site Reliability Engineer in a company. There are very few resources that actually cover the topics and aspects of an SRE role. So, I devised a solution to add value by going through some aspects while learning more about it.
There are a lot of challenges like #100daysofcodes, #100daysofsecurity, and others. I really like this plan for two reasons: first, I will be obliged to complete the challenge; second, it will add some value to myself as well as to the community.
Since I want to avoid flooding Medium with this long challenge, I will keep publishing posts on my own blog and probably add the links here to the items on the list so that you can navigate the posts.
You can find the actual contents and links on this GitHub repository.
Probable Topics for the Challenge
- Introduction to Site Reliability Engineering
- History of SRE and its Evolution
- SLAs, SLOs, and SLIs — understanding the metrics of reliability
- Chaos engineering and the benefits of breaking things on purpose
- Automation and its Role in SRE
- Incident management and response for SRE
- Effective communication during incidents
- Root cause analysis and post-incident reviews
- Monitoring and observability
- Grafana vs Splunk for Monitoring and Observability
- Logging and log analysis
- Alerting and notification strategies
- Capacity planning and management
- Load testing and stress testing
- Disaster recovery planning and testing
- High availability and redundancy strategies
- Failover and failback procedures
- Performance optimization and tuning
- Bottleneck analysis and resolution
- Troubleshooting techniques and strategies
- Continuous integration and deployment
- DevOps — similarities and differences
- Security — best practices and considerations
- Cloud infrastructure
- Infrastructure as code
- Kubernetes
- Service meshes
- Serverless computing
- Multi-cloud and hybrid cloud strategies
- Building resilience into systems and applications
- Scalability and elasticity in modern systems
- Microservices
- Event-driven architectures
- Distributed systems
- Fault tolerance and reliability in distributed systems
- CAP theorem and its implications for distributed systems
- Consistency models and distributed systems
- Replication and consensus algorithms in distributed systems
- Network reliability and resilience
- Network partitioning and the impact on reliability
- Network congestion and its impact on reliability
- Network latency and its impact on reliability
- Network topologies and their impact on reliability
- DNS — considerations and best practices
- CDN— considerations and best practices
- Storage— considerations and best practices
- Backup and restore strategies for data reliability
- Data consistency and durability in modern systems
- Database reliability and high availability
- Replication strategies and techniques for databases
- Database sharding and its impact on reliability
- CAP theorem and databases
- Monitoring and observability for databases
- Caching — considerations and best practices
- Load balancing — considerations and best practices
- API reliability and best practices
- Mobile app reliability and best practices
- Web app reliability and best practices
- Microservices reliability and best practices
- Cloud-native app reliability and best practices
- IoT device reliability and best practices
- Real-time systems reliability and best practices
- Machine learning model reliability and best practices
- Kubernetes reliability and best practices
- Service mesh reliability and best practices
- Multi-cloud reliability and best practices
- Hybrid cloud reliability and best practices
- Serverless reliability and best practices
- Monitoring and alerting for reliability
- Capacity planning and management for reliability
- Incident management and response for reliability
- Post-incident analysis and review for reliability
- Configuration management for reliability
- Change management for reliability
- Testing and verification for reliability
- Rollback and recovery for reliability
- Automation
- Hiring and retaining SRE talent
- Managing remote SRE teams
- Budgeting and cost optimization for reliability
- Metrics and reporting for reliability
- Developing an SRE career path
- SRE certifications and training programs
- SRE community and resources
- Emerging trends in SRE
- Building a Culture of Reliability in Organizations
Concluding Remarks
Well, that’s quite a long list. I hope I will not miss days unless there is a good reason for it. And I hope you guys will be there throughout my journey and wish me the best.
Have a great day! Cheers!!!