Big Data Platform Migration from EMR to EMR on EKS — Success Story
Processing Terabytes of data within an hour and other complex workloads, can get intensively difficult and complicated. Even with EMR, it became quite complicated and challenging. At such scale we faced issues such as
- Efficiency
- Operability
- Observability
Scalability:
- We are processing more than TB of data within an hour which means 200+ tables and more than 500+ KPIs (Data products) for day-to-day business decision making.
- Maximum EMR concurrency is 200, which means it seems possible to process large enough data with the required SLAs in a single cluster. But when we have wanted more concurrency, a possible solution that made sense to us was to use a cluster for each domain. However the maintenance of each cluster was required, which multiplied as we created multiple clusters. Perhaps at times, the maintenance cost is also bearable, but what what about scaling and changing instance types. The fact is most of the AWS instances are not compatible with the EMR cluster. The foremost important instance for us was memory-optimized for park and Hudi streaming
- AWS EMR auto-scaling didn’t work for most of the use-cases, it takes more than usual, longer time to up since the auto-scaling process takes a while to upscale the resources when needed. It didn’t work for us because then our SLAs were violated
- On the other hand, on EKS we have the freedom to use almost all the instances, we did some trial and error kinda stuff but now it works pretty smooth for us. Apart from the instance, the availability of EKS autoscaling is far better than EMR on EC2 and the best thing for us in EKS is karpenter.
Reliability
- Processing and managing big data means bigger costs. Platform and big data engineers are also responsible to build and run cost-optimized solutions. After a few brainstorming and hair bugging sessions, we observed that there were a few non-peak hours, which was a time when our data stakeholders did not need fresh data. This helped us create a policy to downscale cluster automatically using EKS and since there was a reduced load on clusters. Hence it helped us save money without disturbing business use-cases.
- Even with a fast enough autoscaling using Karpenter with EMR on EKS, we wanted to save more money to get a better return on investment (ROI). So we figured out that we could use Spot instances in EKS, which are pretty cheap instances that are only available during some time, not like on-demand instances where availability is guaranteed.
- With EMR on EKS, your compute resources can be shared between your Apache Spark applications and your other Kubernetes applications. Resources are allocated and removed on demand to eliminate over-provisioning or under-utilization of these resources, enabling you to lower costs, as you only pay for the resources you use.
Logs Monitoring and Observability
Another important point is the Spark History Server; here are some differences:
- With EMR, it was available only during the cluster execution.
- With Kubernetes, it has been launched with S3 persistence, therefore the historical data is available after cluster termination.
- In the EMR scenario, the monitoring solution had some caveats regarding workflow complexity that created friction between teams because of the lack of standardization.
- With the Kubernetes solution, our Platform Engineering team at Bazaar provides an opinionated workflow that helps to easily add Prometheus compatible metrics and visualize using dashboards on Grafana.
This article is part of a series of posts related to EMR on EKS and will be followed by another part with a tutorial on Apache Hudi : EMR on EKS
Disclaimer:
Bazaar Technologies believes in sharing knowledge and freedom of expression, and it encourages its colleagues and friends to share knowledge, experiences and opinions in written form on its medium publication, in a hope that some people across the globe might find the content helpful. However the content shared in this post and other posts on its medium publication mostly describe and highlight the opinions and experiences of the author, which might or might not be the actual and official perspective of Bazaar Technologies.