Look-up table maintenance in Hive

Kirill Pavlov
Sep 17, 2016 · 4 min read

A look-up table is a translation table, aimed to enrich and extend base data. Such tables are very common, especially in data warehousing (schema normalization) and business analytics area. Usually, they are updated manually and developers are constantly looking for the ways to simplify maintenance.

This article shows how to work with look-up tables in Hive using National Hockey League open data.

Data

Let’s consider the following example: base table consists of San Jose Sharks 2016–2017 NHL season schedule. The table consists of records with game date, start time and description of a competitor. Initial CSV file with schedule has the following format:

Image for post
Image for post

Each team in NHL belongs to one of the four divisions, for example, San Jose is in Pacific. Teams within the same division play more often than teams from different divisions. Our goal is to analyze, how many games are played against each of four divisions teams.

Base table with the schedule is located in a database already, but it does not have information about divisions. If data is small, then one can use SQL expression to add necessary information: “if team == xxx then division = yyy” However this is not the case. There are 30 teams in NHL and nobody would write such a big query.

A way to go is to manually create a small table with team-to-division mapping and join it with original data. The other question is in maintenance — even small data needs to be updated time to time. One of the best tool to use for look-up data manipulation is Excel because it exists nearly everywhere and a lot of people know how to work with it. Data from Excel is exported to CSV and uploaded to Hadoop. This is a general idea, let’s talk about the details.

Schedule data in Hadoop

The data is taken from official NHL site and uploaded to Hadoop as a CSV file. To read it I use spark-csv library. In order to extract dates from strings and extract actual competitor name, I use the following code:

So, schedule DataFrame has two columns: date and competitor.

Look-up table

To add division information to each team we would create a table teams.xlsx in Excel (from Wikipedia):

Image for post
Image for post

Then one need to upload exported teams.csv to Hadoop as shown on the work-flow below:

Image for post
Image for post

Next step is to add look-up data to Hive. It ensures that schema is persistent, so data update would not change it. One may possible to read look-up table with spark-csv as we did with base table, but every single time it would require proper type cast if a schema is not inferred correctly.

Since the data is already stored in Hadoop, there is no need to copy it to Hive. External table would work:

Note, that

  • LOCATION field should have full file address in Hadoop,
  • LOCATION address is a folder with one CSV file in it,
  • Data header should be skipped in table properties (“skip.header.line.count”=“1”).

This external table works fine in Hive

But it does not work with HiveContext in spark. For some reason it keeps header row (Team, Division, Conference):

One way to fix it is to create a Hive view, which filters the data. Based on my observations, data warehouse techniques are penetrating into Hive (which is an unstructured data warehouse, by the way). One of them is to maintain source tables as they are and read the data only from views. It aims to simplify data structure update. Because of the same reason, I advocate Hive views, even if they implement simple logic, such as in our case:

Now HiveContext reads the data without header, so the issue is fixed.

Join data with look-up table

At this point of time, it should be easy to join the data and get the result. However, schedule and look-up tables have different team names: base table has either city or competitor name. To ensure that all of the code examples work here is a fix: add short_name to the look-up table, which is either first, last or first two words from the team name.

Now short_name is almost the same as competitor. One need to do a “fuzzy” join. Check out join expressions for more information. We would do join based on Levenshtein distance, which is 0 for exactly the same strings and small for nearly the same. There is no golden rule for such join, but initial distance should be big enough to find fuzzy look-up match and small enough to not Cartesian-join both tables. Result table could have several look-up teams for one original. To filter wrong matches, we would use window function, which keeps only match with lower Levenshtein distance.

Finally, schedule data is enriched and one may possible to compute the result:

As expected, San Jose plays more games against other teams from Pacific division, but there are also slightly more games against Central division teams. Did you know it before?

Conclusion

This tutorial has shown, how to deal with look-up tables with Hadoop and Hive, including header row fix.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store