Using HBase to store and analyze NFL play-by-play data

Karthik Kumar
7 min readFeb 2, 2014

--

In celebration of Super Bowl Sunday, this post will examine how HBase can be used to model NFL play-by-play data.

We will be using data from the 2002 season through the 2012 season. The source of the data is CSV files provided by the awesome people at Advance NFL Stats. Each file contains data for a single season and the format of the data is as follows:

gameid|qtr|min|gsec|off|def|down|togo|ydline|description|offscore|defscore|season

The gameid column contains the data of the game and the two teams competing. The description column contains a short natural language blurb about a certain play. The rest of the columns are self explanatory. Here is a snippet of a sample file from the 2012 season (which also happens to be Robert Griffin III’s first plays in the NFL):

20120909_WAS@NO,1,-5,0,WAS,NO,,,69,B.Cundiff kicks 65 yards from WAS 35 to end zone Touchback.,0,0,2012 20120909_WAS@NO,1,60,0,NO,WAS,1,10,80,(15:00) D.Brees pass incomplete short left to M.Colston.,0,0,2012 20120909_WAS@NO,1,59,57,NO,WAS,2,10,80,(14:57) D.Brees pass incomplete short left to L.Moore [A.Carriker].,0,0,2012 20120909_WAS@NO,1,59,52,NO,WAS,3,10,80,(14:52) (Shotgun) D.Brees pass incomplete short middle (S.Bowen).,0,0,2012 20120909_WAS@NO,1,59,47,NO,WAS,4,10,80,(14:47) T.Morstead punts 59 yards to WAS 21 Center-J.Drescher. B.Banks pushed ob at WAS 32 for 11 yards (S.Shanle).,0,0,2012 20120909_WAS@NO,1,59,34,WAS,NO,1,10,68,(14:34) (Shotgun) R.Griffin pass short left to P.Garcon to WAS 32 for no gain (C.White). Pass -4 YAC 4,0,0,2012 20120909_WAS@NO,1,58,58,WAS,NO,2,10,68,(13:58) (Shotgun) R.Griffin right end to WAS 44 for 12 yards (S.Shanle).,0,0,2012 20120909_WAS@NO,1,58,25,WAS,NO,1,10,56,(13:25) (Shotgun) R.Griffin pass short left to P.Garcon to NO 44 for 12 yards (R.Harper). Pass -2 YAC 14,0,0,2012 20120909_WAS@NO,1,57,44,WAS,NO,1,10,44,(12:44) (Shotgun) R.Griffin pass short right to P.Garcon to NO 35 for 9 yards (C.Lofton; C.White).,0,0,2012

Why HBase?

In a previous post, we looked at the HBase’s Data Model. One of the killer features built into HBase was the ability to store versioned data. Data that spans a certain time-series could be modeled easily in HBase. NFL play-by-play data fits nicely with this model, since each play of a game can be considered as a time slice that captures the state of a game at that instant. This means that we can replace the timestamp field with an integer from 0 to n — 1, where n is the number of total plays in the game.

Another reason for using HBase is its innate ability to scale for large amounts of data. We will be storing data for 11 seasons each with 269 games (regular season and playoffs) with around 175 plays for each game. This totals to around 517,825 data points that we will currently be storing. This may not seem like much, but consider that if we start adding data from games before 2002 and we continue to add data up to the current season, we will soon be glad that we chose a scalable data store like HBase.

Schema

We won’t spend too much discussing the HBase schema decisions. The HBase documentation does a good job of explaining the different schema design considerations. We will be using gameid (see above) as the row key. This will allow us to store data for each game in a single row. We will use a single column family called ‘play’. Our columns will be the different attributes of a single play (for example: yard line, offense score). As discussed before, we will use a play number as the timestamp for each column. Here is a diagram that more succinctly describes our schema:

Table creation & Data import

Now that we have looked at the data we will be storing and the schema we will follow, let’s finally create our table, and put some data in it. There are several APIs and libraries for different languages that allow us to interact with HBase. We will mostly be using the HBase shell, which is a JRuby IRB-based shell that allows us to run simple data management and manipulation commands. To import data from the CSV files into HBase, we will use JRuby, which is the “Ruby Language on the JVM”.

Let’s start by creating the table. The below command creates a new table called ‘nflplays’, with a single column family called ‘play’ and the specification that we want to store ALL_VERSIONS (2147483647) of a column. This may seem like overkill, we should never really have more than ~300 plays in a game. But I am doing this to illustrate how we can use constants from HConstants when defining table characteristics.

create 'nflplays', {NAME=> 'play', VERSIONS=>org.apache.hadoop.hbase.HConstants::ALL_VERSIONS}

Next, we use the describe command to verify that the table we just created. We can see a few of the characteristics that this table contains. The ‘ENABLED’

hbase(main):002:0> describe 'nflplays'
DESCRIPTION ENABLED
'nflplays', {NAME => 'play', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '2147483647', COMPRESSION => ' true
NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', B
LOCKCACHE => 'true'}
1 row(s) in 1.5100 seconds

Now that we have our table created, we can finally start importing the data from the CSV files into HBase. For this, I’ve written up this crude script in JRuby that goes through each line and parses out the different elements. Here it is:

include Java
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Put
import javax.xml.stream.XMLStreamConstants

def jbytes( *args )
args.map { |arg| arg.to_s.to_java_bytes }
end

def parse_row(row)
map = {}
values = row.split(',')
map['gameid'] = values[0]
map['qtr'] = values[1]
map['min'] = values[2]
map['sec'] = values[3]
map['off'] = values[4]
map['def'] = values[5]
map['down'] = values[6]
map['togo'] = values[7]
map['ydline'] = values[8]
#The csv file contains some weird characters at the Game over plays, this little hack is to avoid parsing that
map['description'] = (map['qtr'].to_i >= 4 and map['min'].to_i == 0 and map['down'].empty? and map['togo'].empty?) ? "Game Over" : values[9]
map['offscore'] = values[10]
map['defscore'] = values[11]
map['season'] = values[12]
return map
end

def put_into_hbase(document, play_number)
table = HTable.new(@hbase.configuration, 'nflplays')
table.setAutoFlush(false)
document.each do |key, value|
if !value.empty?
rowkey = document['gameid'].to_java_bytes
ts = play_number

p = Put.new(rowkey, ts)
p.add(*jbytes("play", key, value))
table.put(p)

puts play_number.to_s + ":" + key.to_s + ":" + value.to_s
end
end
table.flushCommits()
end

count = 1
seen_before = {}
File.open('2012.csv', 'r:utf-8').each_line do |row|
data = parse_row(row.strip!)
if !seen_before.has_key?(data['gameid'])
count = 1
seen_before[data['gameid']] = true
end

put_into_hbase(data, count)
count += 1
end
exit

A few things to notice:

  • With JRuby, we can import classes from Java using familiar import statements
  • Notice that each record is inserted with a Put object.
  • We set autoFlush to false, this buffers Puts with a client-side write buffer. We force the flush of the buffer using flushCommits(). See the documentation for more information.

After running this script, we can view some of the data that has been inserted into HBase. First we can count the number of rows we have inserted (I’ve was running HBase on a micro EC2 machine, and I did not want to strain my already scarce resources so I only imported the first 8 games from 2012.

hbase(main):009:0> count 'nflplays' 
8 row(s) in 0.0260 seconds

We can scan all the rows that we have imported using the ‘scan’ command in the HBase shell. Here is a snippet of the output I get:

ROW                                                          COLUMN+CELL                                                                                                                                                                     
20120909_STL@DET column=play:def, timestamp=158, value=DET
20120909_STL@DET column=play:defscore, timestamp=158, value=27
20120909_STL@DET column=play:description, timestamp=158, value=Game Over
20120909_STL@DET column=play:down, timestamp=157, value=1
20120909_STL@DET column=play:gameid, timestamp=158, value=20120909_STL@DET
20120909_STL@DET column=play:min, timestamp=158, value=0
20120909_STL@DET column=play:off, timestamp=158, value=STL
20120909_STL@DET column=play:offscore, timestamp=158, value=23
20120909_STL@DET column=play:qtr, timestamp=158, value=4
20120909_STL@DET column=play:season, timestamp=158, value=2012
20120909_STL@DET column=play:sec, timestamp=158, value=2
20120909_STL@DET column=play:togo, timestamp=157, value=10
20120909_STL@DET column=play:ydline, timestamp=158, value=61
20120909_WAS@NO column=play:def, timestamp=190, value=WAS
20120909_WAS@NO column=play:defscore, timestamp=190, value=40
20120909_WAS@NO column=play:description, timestamp=190, value=Game Over
20120909_WAS@NO column=play:down, timestamp=189, value=2
20120909_WAS@NO column=play:gameid, timestamp=190, value=20120909_WAS@NO
20120909_WAS@NO column=play:min, timestamp=190, value=0
20120909_WAS@NO column=play:off, timestamp=190, value=NO
20120909_WAS@NO column=play:offscore, timestamp=190, value=32
20120909_WAS@NO column=play:qtr, timestamp=190, value=4
20120909_WAS@NO column=play:season, timestamp=190, value=2012
20120909_WAS@NO column=play:sec, timestamp=190, value=1
20120909_WAS@NO column=play:togo, timestamp=189, value=10
20120909_WAS@NO column=play:ydline, timestamp=190, value=39

Here we see the last play for two games. Since I didn’t specify how many verisions to return, it only returns the most recent version (last play of each game). We also see each column for a single row. Also notice that the timestamp here is the play number (see above for why this is the case). If we wanted to see only the play descriptions for a single game, we can use this query:

scan 'nflplays', {COLUMNS=>['play:description'], STARTROW => '20120909_STL@DET', ENDROW => '20120909_STL@DET', VERSIONS =>200}

Let’s take a look at the Washington vs New Oreleans game, notice that some columns (like ‘togo’) only has 189 values, whereas the total number of plays is 190. This indicates that for play 190, we did not have a value for ‘togo’. This illustrates the ‘scarce’ property we looked at in the previous post.

Analyzing the data

Now that we have our data in HBase, we can run a few queries on that data. Here is a simple Java program that scans the table for all plays that contain the word ‘TOUCHDOWN’ in the description. You can imagine how we can extend this program to create more complex queries, using different Filters.

public class Touchdowns {
public static void main(String[] args) throws IOException {
Configuration config = HBaseConfiguration.create();
HTable table = new HTable(config, "nflplays");

Scan s = new Scan();
s.setMaxVersions();
s.addColumn(Bytes.toBytes("play"), Bytes.toBytes("description"));

Filter tdFilter = new ValueFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator(".*TOUCHDOWN.*"));
s.setFilter(tdFilter);

ResultScanner scanner = table.getScanner(s);
try {
for (Result result : scanner) {
for (KeyValue kv : result.raw()) {
System.out.println("KV: " + kv + ", Value: " + Bytes.toString(kv.getValue()));
}
}
}
finally {
scanner.close();
}
}
}

--

--