A look-up table is a translation table, aimed to enrich and extend base data. Such tables are very common, especially in data warehousing (schema normalization) and business analytics area. Usually, they are updated manually and developers are constantly looking for the ways to simplify maintenance.
This article shows how to work with look-up tables in Hive using National Hockey League open data.
Let’s consider the following example: base table consists of San Jose Sharks 2016–2017 NHL season schedule. The table consists of records with game date, start time and description of a competitor. Initial CSV file with schedule has the following format:
Each team in NHL belongs to one of the four divisions, for example, San Jose is in Pacific. Teams within the same division play more often than teams from different divisions. Our goal is to analyze, how many games are played against each of four divisions teams.
Base table with the schedule is located in a database already, but it does not have information about divisions. If data is small, then one can use SQL expression to add necessary information: “if team == xxx then division = yyy” However this is not the case. There are 30 teams in NHL and nobody would write such a big query.
A way to go is to manually create a small table with team-to-division mapping and join it with original data. The other question is in maintenance — even small data needs to be updated time to time. One of the best tool to use for look-up data manipulation is Excel because it exists nearly everywhere and a lot of people know how to work with it. Data from Excel is exported to CSV and uploaded to Hadoop. This is a general idea, let’s talk about the details.
Schedule data in Hadoop
The data is taken from official NHL site and uploaded to Hadoop as a CSV file. To read it I use spark-csv library. In order to extract dates from strings and extract actual competitor name, I use the following code:
So, schedule DataFrame has two columns: date and competitor.
To add division information to each team we would create a table teams.xlsx in Excel (from Wikipedia):
Then one need to upload exported teams.csv to Hadoop as shown on the work-flow below:
Next step is to add look-up data to Hive. It ensures that schema is persistent, so data update would not change it. One may possible to read look-up table with spark-csv as we did with base table, but every single time it would require proper type cast if a schema is not inferred correctly.
Since the data is already stored in Hadoop, there is no need to copy it to Hive. External table would work:
- LOCATION field should have full file address in Hadoop,
- LOCATION address is a folder with one CSV file in it,
- Data header should be skipped in table properties (“skip.header.line.count”=“1”).
This external table works fine in Hive
But it does not work with HiveContext in spark. For some reason it keeps header row (Team, Division, Conference):
One way to fix it is to create a Hive view, which filters the data. Based on my observations, data warehouse techniques are penetrating into Hive (which is an unstructured data warehouse, by the way). One of them is to maintain source tables as they are and read the data only from views. It aims to simplify data structure update. Because of the same reason, I advocate Hive views, even if they implement simple logic, such as in our case:
Now HiveContext reads the data without header, so the issue is fixed.
Join data with look-up table
At this point of time, it should be easy to join the data and get the result. However, schedule and look-up tables have different team names: base table has either city or competitor name. To ensure that all of the code examples work here is a fix: add short_name to the look-up table, which is either first, last or first two words from the team name.
Now short_name is almost the same as competitor. One need to do a “fuzzy” join. Check out join expressions for more information. We would do join based on Levenshtein distance, which is 0 for exactly the same strings and small for nearly the same. There is no golden rule for such join, but initial distance should be big enough to find fuzzy look-up match and small enough to not Cartesian-join both tables. Result table could have several look-up teams for one original. To filter wrong matches, we would use window function, which keeps only match with lower Levenshtein distance.
Finally, schedule data is enriched and one may possible to compute the result:
As expected, San Jose plays more games against other teams from Pacific division, but there are also slightly more games against Central division teams. Did you know it before?
This tutorial has shown, how to deal with look-up tables with Hadoop and Hive, including header row fix.