Optimizing DynamoDB WCU utilization in EMR-Hive

Axel Springer Tech

Published in

Axel Springer Tech

3 min readMar 11, 2020

By Somasundaram Sekar

User engagement score

Upday’s passion for providing high-quality, personalized content to the user means, continuous monitoring of User-Content interaction like swipes, clicks and time spent on the content itself and converting those into user engagement scores, which, can then be used for improving the UX by surfacing better content relevant to the user.

For us the engineering team this means generating “~ 3 Billion” engagement scores every day and stashing them into a highly scalable, highly available, low latency storage. So DynamoDB it was.

The Data pipeline

Our data seamlessly flows through various parts of our systems co-ordinated by Apache Airflow based data pipeline.

The User engagement score pipeline generates the scores and also write them to the DynamoDB table.

The Read workload on the table is predictable, as you can see from the (ConsumedReadCapacityUnits/second) Cloudwatch graph. Hence we have a Provisioned Read Capacity Units that also serves to cover the occasional bursts in the reads.

Predictable Read Capacity Units consumption

The Write, though, is batched and happens only once a day and we needed to have predictability on both cost and the performance. The pipeline writes to the DynamoDB table from an EMR cluster running Apache Hive application using emr-dynamodb-connector.

To have better control over the write throughput we bump Provisioned Write Capacity Units(WCU) of the table to 30000 before we run the Hive query and upon completion, we change it back to “1 WCU”.

The Problem

Alles gut until we started noticing that the provisioned capacity was underutilized at times and the job taking 2x more time to complete.

Upon analyzing the Hive logs, it was evident that the issue was due to idle mappers caused by Tez’s “split grouping” which grouped the files into fewer Splits, affecting the number of records being written parallelly into table and consequently the consumed WCU.

Tez and the split count

With Amazon EMR release 5.x , Tez is the default execution engine for Hive instead of MapReduce. Tez provides improved performance for most workflows.

Tez, among other parameters, uses tez.grouping.max-size & tez.grouping.min-size to calculate the split and inturn the parallelism.

This was the root cause of underutilization of WCU in our case, the gzipped CSV files were grouped into splits fewer than the number of map slots available, thus affecting the parallelism.

Solution

The solution was to control the parallelism(a.k.a Mappers in this case) and to set the split count to be much higher than the available mappers for proper WCU utilization.

We noticed that the utilization was maximum when the number of mappers was around the 30s’, so we set the number of mappers to 30 and the tez.grouping.split-count to the number of files available in the S3 location, which is always way higher than 30, at least 10 times more.

# Hive configuration properties 
SET tez.grouping.split-count = [file count];
SET mapreduce.job.maps = 30;

Summary

With the above configurations in place, the utilization of WCU improved significantly and remained consistent, as evident in the Cloudwatch graph below.

Though Tez’s default grouping behavior optimizes the number of map tasks for better utilization of resources, it turned counterproductive in this particular case because of the need for higher throughput while writing to the DynamoDB and availability of the configuration properties to tune this behavior came very handy.

This story was originally published on the upday devs blog.