How to parse billion of User Agents [using Pyspark]

Mahadir Ahmad
Nov 3 · 7 min read

When your organization practice collecting (almost) everything first and (do) process it later, it has to be a time when you need to analyze records that has been sitting in storage for a prolonged time. The size of the records could reach multi-terabytes with billion of entries. This is when the traditional approach of data processing is at its limit.

Recently I need to analyze a considerably huge clickstream data with the queries related to device and browser. As a python developer my first step is to locate package that can parse user agent from the clickstream and returns the device, OS, browser and related information. I found the right package that can do the job with this library: Python User Agent.

The next step is I want to find out how fast this package can parse the user agents so I can estimate the time if I were to run it on a billion records. I use Jupyter to prototype my steps (This run in EC2 ml.t2.large should it matters). With 10k sample combined with timeit magic I can roughly get the average iteration.

The average iterations per second is 699.30 but I’ll round it to 700. Let’s do a quick maths here if I were to parse a billion entries:

I probably required to have 49 parallel execution to complete the parsing within 8 hours. That’s only part of the story, I also need to code for scheduler and do the testing for reliability. Give or take, might cost me few days to finally see the result. Using existing tool that can solve this problem would be much more desirable.


Apache Spark

Fortunately, we can accomplish this task without sweating by using Apache Spark. In AWS apache Spark can be found in ElasticMapReduce (EMR) service. You can spin on demand cluster from EMR and accomplish the distributed computation requirements without the need to build anything from scratch. For cluster configuration, we can use Task Node only (we’ll use s3 as HDFS storage). If this is a repeating operation you can utilize the steps options when launching the cluster to automatically run the script and terminate once completed.

But for our case, we will be using a quick way to submit our pyspark parsing application (let’s call this ETL job) through sparks-submit cli and some munging with zeppelin notebook.

Before we code our ETL, let’s take a step back, could it be other way to optimize our job? Parsing the user agent is a costly operation but join operation relatively isn’t, if we can parse only unique user agent and perform a join operation separately, would it be faster?. Let’s find out the total unique entries.

Wow, only 8M unique records out of 1B or roughly 0.9%. We could theoretically speed up the process by 115X. If you notice from Zeppelin notebook above the data is already partitioned and in parquet format. By partitioning with right numbers and read from columnar storage, this will naturally help speed up operation as well. We’ll touch about partitioning later in this article but if you’re interested to learn more I highly recommend this partitioning in apache spark.

Now, we’re going to prepare the unique user agent into a new file to make our ETL jobs faster by directly reads from parquet . Notice that I also include the count aggregation, it might be handy we somehow decided to filter only user agent having certain count threshold.

The descriptive statistics from count above shows half of the entries has occurence below 2. Depending on the scenario, you may want to filter out entries less than or equal to 2 or at extreme anything less than 844. Doing so can effectively reduce the total entry to be parsed by twice or 100X respectively. But for the sake of this article which clearly stated 1B user agents, I’ll leave the data as it is otherwise the final result will not add up to 1B right?


Pyspark UDF

Now that we’ve optimized our operation, let’s quickly prototype simple user agent parser from Zeppelin notebook. In order to do so we need to understand the concept of Pyspark function or UDF. I highly recommend you to read through this article and understand the concept first before proceed.

Copy our prototype of ETL script above into EMR Master server (SSH into it):

$ ssh emr
$ nano test.py
$ spark-submit test.py

Opps! We encountered ‘user_agents’ module not found error. The reason is we haven’t install the package before executing the submit. You might be thinking by now how to install the parser package on our cluster, imagine if the cluster has 10 task nodes, do we need to login to each and every server and execute pip install?. The answer is Yes and No, both were possible to solve the dependencies, but the elegant way is we can zip pip modules into a file and submit together with our ETL script. Remember the Task Node we configured with spot instance? At any time, AWS might shut it down and therefore the first option were not something we should go after.

# Let’s package our module into a zip file
$ pip-3.6 install pyyaml ua-parser user-agents -t dependencies
$ cd dependencies
$ zip -r ../dependencies.zip .
$ cd ..
# you may choose to move this file to s3 if you need to submit job via livy
# optional
# aws s3 cp dependencies.zip s3://involve-data-science/emr/dependencies.zip

With py-files option we can now submit the pyspark script and its dependencies.

$ spark-submit — py-files dependencies.zip test.py


By now, it should be pretty much straight forward to code our final ETL.

The only part I want to comment is the repartition. If you still remember the number of partition for the unique user agent dataframe is 25. I actually tried with this default number and I notice only 8 executors were used. But with 36 vcores I can theoretically distribute to at least 36 executors. Hence, I used 3x multiplier to make the new partition size to 108. Practically, from the test I ran this decreased the time to complete the task from 56 minutes to only 31 minutes.

At this stage, you’ll have a lookup file and can perform join operation with any pyspark dataframe that have user agent column with minimal cost. For sake of completeness I’ll demonstrate an example to join it with sales data.

With 5M sales records joining to our lookup dataframe and perform filtering of mobile device and amount aggregation only took 47 seconds. You can now inform the business department the top mobile Operating System (OS) is Android with RM 3B of sales followed by iOS.

We can do so much more analysis using this lookup, however if you’re a data engineer you may now pass the next stage to data analyst / data scientist.


Cost Analysis

Congratulation if you manage to read until here, for final part I think it’s not complete without cost calculation. Let’s do some quick estimation how much I’ve spent in AWS, after all we’ll need to explain whether by reducing the time from 16 days to couple of hours were justified.

In term of cost we only use S3 for HDFS storage and EMR. For storage, I’ve transformed original data of 1B ~ 1TB into a gzip compressed parquet files with new size of 180GB.

All the tasks above should be completed within 3 hours but worse case it may takes longer when we take into account testing and time spent for troubleshooting, so just adding it up to 8 hours which is common office working hour.

The cost structure for EMR can be found here. As for spot instance the price was $0.0654 per hour for m5.xlarge.

($0.048 * 10 * 8) + ($0.24 * 1 * 8) + ($0.0654 * 9 * 8) = $10.47

As for s3 I will round up to 200GB to take into account new files that were created to store the lookup and probably some failed testing. The pricing can be found here. To calculate the usage for 8 hours we need to convert to GB-Month.

200*1024*1024*1024*8/1073741824/744 = 2.15
2.15 * $0.024 = $0.05

Overall our cost should be no higher than $10.52 (0.05 + 10.47). By taking the optimistic estimation of 3 hours the cost down to only $3.95 or RM 16.20.

All price calculations are based on Singapore region. I’ve prepared the spreadsheet for you to play around with the hours.

Mahadir Ahmad

Written by

Data Scientist | Software Engineer

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade