Supporting Multiple Time Zones on Hive with Single Data Source

Published in

Udemy Tech Blog

8 min readJul 20, 2020

The same “present” being represented with different times using different time zones

At Udemy, most internal analytics dashboards rely on data residing in Pacific Time (PT) partitioned Hive tables. However, data collected from our event tracking system is partitioned based on Coordinated Universal Time (UTC). To enable efficient analysis of these datasets while avoiding both complicated joins and unnecessary data duplication, we used custom code in order to create a “view” of UTC data as PT within Hive.

In this blog post, I will be sharing our experience with supporting multiple partitions on Hive while keeping one copy of data on our data lake.

What is the Event Tracking System?

The Event Tracking framework at Udemy is a scalable and highly available event ingestion system for collecting user behavioral data. The overall data flow consists of three stages:

Data is prepared and sent to the system via provided client libraries.
Event Tracking System consists of 4 microservices to process collected data from clients. Data transfer between those services is done through Kafka.
Data is synced to AWS S3 and Hive metadata is updated for long term storage and batch processing.

High-Level View of Data Storages of Event Tracking

In this blog post, the focus will be on the third stage of the system.

The journey of the data on the Event Tracking System after arrival and processing consists of two main steps: syncing the Kafka with AWS S3 and registering the new data partitions to Hive external tables.

Event Tracking uses Kafka as its real-time source of truth of data. At the end of the journey on Kafka, events are copied from Kafka to AWS S3 by an open-source component named Kafka Connect. We use hourly UTC partitions while storing data on S3. Hourly partitions reduce the query time by limiting the range of data with relevant partitions, thus reducing the amount of I/O. For this, both Kafka Connect is configured accordingly and Hive tables are created with four partition columns: year, month, day and hour.

When the data for a new partition appears on S3, Hive external tables require registration of these new partitions to tables metadata. For this purpose, a simple workflow querying Hive for partition registration is executed hourly. After that operation, newly arrived data can be queried from Hive.

Problems

For context, this is how Event Tracking System handles time information:

All timestamp fields are stored as UNIX timestamps. UNIX timestamp fields can be easily converted to different time zones. Note that the four partition columns (i.e., year, month, day, and hour) are stored as string fields, instead of UNIX timestamp fields since they are not actually storing timestamps but rather discretized sections of it.
Kafka Connect creates hourly S3 prefixes for the data while syncing the Kafka with data lake on S3. Kafka Connect is configured to use hourly UTC partitioning while writing data to our data lake, so the source of truth data is stored UTC partitioned. Using UTC is beneficial because it’s the industry standard and is not affected by Daylight saving time.

Hive cluster of Udemy is configured for PT timezone

First problem arises when we use this data in a downstream system that uses a different timezone, like Udemy’s Hive cluster.

Hive shows the UNIX time fields in its local time, which is PT in our case.

Consider the following SQL:

It returns:

This seems a bit weird since we query the 9th day of May, but the returned data belongs to the 8th of May. The reason for this behavior is that while the partitions are in UTC, Hive converts the UNIX timestamp data inside the time fields to its local timezone, which is PT. Note that if we replace the filtering to receiveTime field, the issue would be solved but the query time would be increased greatly. Issue of inconsistent times in timestamp fields and the queried partition time was asked frequently to the Event Tracking Team, confusing the stakeholders.

Many data sets at Udemy are partitioned using PT

This discrepancy causes Event Tracking data to be hard to join with the PT partitioned data.

Consider the following SQL to get data for 2020–04–09 in PT:

We needed to manually adjust UTC partitions to match with PT.

This is far from ideal since it is both counterintuitive and cumbersome to require explicit hourly filtering when one wants to join event tracking data with other tables. Also, the above query is particularly fragile since one would need to account for Daylight Savings Time by changing the hour dynamically depending on the time of year. This kind of behavior makes it necessary to have support for PT partitioned data.

So we wanted to support PT partitioned tables as well as UTC ones to enable other teams and improve downstream code quality by eliminating the need for custom time handling on each workflow separately. Also, we wanted to keep UTC partitioned tables to be prepared for an eventual migration to UTC over the long run.

Possible Solutions to support PT partitions

The first and most straightforward solution is to have another connector definition on Kafka Connect. Connectors are logical entities that state the source and destination locations of data and the related configurations for syncing the data between those two data stores. One of the configurations that can be utilized is the `timezone` parameter. We can use this parameter for creating the data on AWS S3 in PT partitions. In this solution, since we sync the data twice to S3, the storage we use, and the load of Kafka Connect would be doubled.

This solution has some other issues like losing the clear source of truth definition since there will be two duplicates of the same data on S3 and Hive partitioned differently.

The second solution is to create a new set of external Hive tables and manually assign the Hive partitions to S3 partitions considering the time difference between PT and UTC for the time data belongs to. This would be handled by a custom workflow since the time difference between PT and UTC changes twice a year.

Hive allows such a query to manually state which of its partitions are mapped to which S3 folders:

We can use this ability to map S3 folders with an offset, by changing the offset twice a year. The challenging part of this solution is handling time offset changes with the UTC data.

We decided to go with the second approach because the first solution requires a high amount of data duplication.

Implementation of the Second Solution

We already had a workflow that works hourly and creates Hive partitions linked with S3 prefixes.

We modified this workflow to also support PT partition creation:

For the times without timezone shifts

The Hive table timezone has become a parameter. The workflow is now given Hive’s partition timezone and acts accordingly for that timezone, supporting all possible time zones.
For partitions that do not involve PST to PDT or PDT to PST timezone shift (all partitions in a year, except for two hours); we can easily calculate the PT timezone offset and adjust the UTC S3 data with that offset to determine hive partition time.

Handling the two shifting hours every year

PT has two time zones that alternate through the year, PST (Pacific Standard Time) to PDT (Pacific Daylight Time). We need to handle two special hours where the shift occurs: PST to PDT shift and PDT to PST shift. For that, a workflow detects if currently, the change is happening, and acts accordingly.

PST to PDT Timezone Shift

Handling PST to PDT is easy since the shift does not cause a collision on the mapping. You can see that the corresponding times for UTC are unique on the PST to PDT shift in the below image.

What happens is that, in Hive, there will be no partition for 2 a.m., which is expected since PT timezone does not have 2 a.m. on the PST to PDT shift day. Since there is no collision on the UTC to PT mapping here, we can easily state the locations of the data to Hive for each hour in PT.

PDT to PST Timezone Shift

Handling PDT to PST is a bit trickier since the mapping from UTC to PT is not one-to-one during this shift. It can be seen that 1 a.m. happens twice in PT, making it necessary to register two separate external locations to Hive for 1 a.m., which Hive does not support.

Hive only supports a single external data location prefix per partition. To overcome this issue, we detect the time when PDT to PST shift is occurring, and at that hour only, we copy the data of two intersecting partitions (8 a.m. and 9 a.m. in the above image) into one folder to have a single data partition. After having a unified copy of those two hours, we register the partition to Hive pointing to the new copied content so that Hive will work on two hours of data when asked to query the time of shift.

Code for the described operation looks like the following screenshot:

Note that actual implementation is a bit more complicated since we need to wait for the hour to end before creating a unified S3 folder.

It is a legacy decision that the many existing datasets in Udemy are partitioned by PT timezone. This fact, as described, causes many inconsistencies mainly because of the time zone changes twice a year. The long term plan is to store and serve all data using UTC partitions which will be an ultimate solution for the described problems and many others.

Result

A diagram for the overall journey of data

This algorithm helped us to prevent duplication of data on storage level and to avoid the increased computational resources that Kafka to S3 sync would have needed in the data duplication scenario. We have been able to successfully support PT partitions without changing UTC partitioning of the source data. In this way, Event Tracking data will be ready for the company-wide migration to UTC partitioning, which is the ultimate solution to related problems.