Modularising Hive Queries.

Jay Green-Stevens
The Hotels.com Technology Blog
8 min readFeb 26, 2018

As anyone who works with Hive and HQL (or even SQL) will know, monolithic queries can often become very long and quite tedious to read. I have heard horror stories of encounters with queries containing hundreds, even thousands of lines of code. So, the Big Data Platform team at Hotels.com decided to try find a way to make these large, monolithic queries easier to read, maintain, and test.

Below is an example of a typical monolithic query containing a number of nested queries (which for brevity’s sake have been replaced with ellipses):

Queries like this are hard to follow as it requires each nested query to first be understood within the context of its outer query.

Monolithic queries also become problematic when it comes to testing as the only check that can be made is on the overall output, based on some known input, meaning that the source of errors is difficult to pinpoint. So, to be able to effectively unit test this query it needs to be modularised — broken down into the separate sub-queries in a way that allows these to be tested separately.

Hive provides a number of methods to modularise queries, but as we discovered, it is not necessarily clear which is the best method for this. So as part of my industrial placement at Hotels.com I was enlisted with the task of implementing a number of these modularisation methods, and comparing them in order to determine which was most favourable.

Modularisation Approaches

5 different approaches were considered: variable substitution, HPL/SQL, chained tables (both regular and temporary), and views.

When attempting to implement variable substitution it was discovered that the version of Hive in use did not support this properly. Instead the variable substitution had to be performed in Bash as a workaround. It was decided that this approach was unnecessarily complex and harder to maintain, so it was not pursued any further. It may well be worth revisiting this with a newer version of Hive.

When looking at HPL/SQL it was discovered this wasn’t particularly useful for modularisation as it lacks features required to “pipeline” queries which would allow data to flow from one procedure to another. Due to this, HPL/SQL was discounted too, and instead the investigation moved onto looking into chained tables.

Chained Tables

Chained tables are the most frequently used approach of modularisation and can be implemented using either regular or temporary tables. Both methods use intermediate tables to store the results of nested queries. The key difference is that the temporary tables require less maintenance of the associated storage since the tables and data are automatically cleaned up. This approach allows each nested query to be removed and replaced with a reference to the relevant table. The diagram below illustrates how to implement modularisation in this manner for a query containing a single nested query.

Query modularisation using chained tables.

These intermediate tables break the query down making it easier to follow, and each nested query can now be tested independently.

However, the creation of the tables does mean that there is increased I/O on the data storage system, and this also limits the query optimisation opportunities available to the query planner. This is due to the fact that Hive cannot see the entire query, only the smaller sub-queries and thus can only make limited decisions based on these pieces.

Views

Views break down monolithic queries by encapsulating each nested query statement inside a view, and replacing the statement with a reference to this associated view. The diagram below shows how views would be used to modularise a nested query.

Query modularisation using views.

The usage of views is similar to chained tables, in that an intermediary is used to break up the large query and this can now be tested independently. However, the key difference is that views encapsulate the query itself rather than its results. In addition, as fewer tables are created (compared to using chained tables) there is less I/O on the data storage system, and less associated storage.

Implementation

For this investigation, the implementation of the approaches used a fairly typical use case of analysing a fictitious web log containing information about visitors. The goal is to differentiate between page views from “normal” users and crawlers. A table containing the IP addresses of these crawlers can be used to filter them out of the web hits table, allowing a count of user hits for each page in the log to be produced. Below is an example of the contents of these fictitious “Crawler” and “Web_hit_log” tables.

Tables used for implementation

The “Crawler” table stores the associated name and IP address of a crawler, and whether or not it is active. The “Web_hit_log” table stores the instant a particular url was accessed, and the IP address the hit is associated with.

Below is the monolithic query used in the implementation for the approaches:

CREATE TABLE page_view_agg
AS
SELECT hits.url AS page_url, COUNT(*) AS page_count
FROM (
SELECT w.url
FROM web_hit_log w
WHERE w.ip_address NOT IN (
SELECT c.ip_address
FROM crawler c
WHERE c.active = true
)
AND w.hit_instant BETWEEN '${start_date}' AND '${end_date}'
) hits
GROUP BY hits.url
ORDER BY page_count DESC

Although fairly simple, the query can be easily modularised into 3 distinct parts:

  • Retrieve the IP address of all the active crawlers
  • Select the urls which have not been accessed by an active crawler, i.e. the urls which have been accessed by users
  • Produce a count of all the hits per web page

Using the modularisation techniques discussed above, the queries can now be separated out into their own files.

Each approach can be implemented in the exact same way with the key difference being the CREATE TABLE clause. The chained tables approaches are implemented using CREATE TABLE ... and CREATE TEMPORARY TABLE ... and the implementation of the view approach is shown below.

Testing

As mentioned, a big issue with monolithic queries is that only the final output can be tested. For example, when testing the monolithic query one can only check the number of results returned for the entire query, and what those results are. The code fragment below shows what the JUnit and HiveRunner unit tests for the monolithic query would look like.

@Test
public void checkCount(){
result = hiveShell.executeQuery("select * from page_view_agg");
assertThat(result.size(), is(2));
}
@Test
public void checkAggregate(){
result = hiveShell.executeQuery("select * from page_view_agg");
assertThat(result.get(0),is("http://www.hotels.com/not-obvious"));
assertThat(result.get(1), is("http://www.hotels.com/obvious"));
}

Modularising the query makes testing more effective as it allows for each file containing a nested query to be tested separately. The queries can then be run against temporary tables containing dummy data, with assertions performed on the output to ensure the query fragments are working as expected.

This enables many more test cases to be created and targeted to specific conditions, for example more granular tests can now be written as shown below (the implementation details are replaced with ellipses for brevity’s sake).

public void checkActiveCrawlerFilter() { 
result = hiveShell.execute("SELECT * FROM active_crawlers");
assertThat(result, is(Arrays.asList("192.168.0.123")));
}
public void checkWebHitCrawlerFilter() { ... }public void checkWebHitDateFilter() { ... }public void checkPageViewCount() { ... }public void checkPageViewOrder() { ... }

Splitting up the query and testing it in this way also means any problems can be easily pinpointed and fixed.

The separate files can be tested further using the HiveRunner SOURCE command, which runs the files one after the other. This checks that the output is correctly pipelined from one query to the next and ensures the overall output from the modularised queries is equivalent to the output produced by the monolithic query.

Comparison of Modularisation Approaches

The first comparison to make between the approaches is the execution plans, which show the jobs that Hive performs to execute the query. The plans are produced using the EXPLAIN command and allow the total work the query is doing during execution to be calculated and compared - the bigger the bar on the graph, the more work the query is doing. For the purposes of this experiment only the most resource intensive tasks have been included on the graph.

Comparison of execution plans

As shown by the graph, views require fewer jobs to be executed to perform the same query, compared to the approaches using tables.

It is also worth noting that views produce a single execution plan for the 3 files containing separate queries, allowing Hive to optimise the entire query. In comparison, the chained tables approaches produce a separate execution plan for each file, therefore restricting the optimisation opportunities as outlined earlier.

Another important comparison to consider is the average execution time. For this experiment, the steps required for each approach — including the monolithic query — were executed inside an EMR cluster multiple times with ~100GB of data. The graph below shows the results averaged from 5 runs — the bigger the bar on the graph, the longer the average execution time for the approach.

Comparison of average execution time

The results from this comparison tie in with the previous graph — the chained tables approaches execute the most amount of work, and thus as expected, take longer to execute.

It is worth noting that views appeared to run faster than the monolithic query but this is more likely due to variations in processor, disk, and other cluster resources over the limited number of runs performed.

The conclusion of this investigation found that modularising large monolithic queries is highly beneficial for both understanding and maintaining existing queries, and provides more efficient testing opportunities. In addition, certain approaches of modularisation do not have any detrimental effect on execution of the query, or the running time.

So, for these reasons, we in the Big Data Platform team recommend, and encourage you to implement modularisation where possible, and to do so using views.

--

--