URL parsing in Hive

Published in

MiQ Tech and Analytics

4 min readSep 24, 2019

With the advancement of digital technology in the 21st century, website experience has become a cornerstone of the success of organizations across the globe irrespective of the nature of business they are embedded in. An online search is generally the first touchpoint for customers looking to make purchases or know more about the product. Web engagement is being considered as key indicators to gauge customer loyalty or purchase propensity.

Be it boosting sales, enhancing brand awareness, reducing customer churns or planning spends across multiple advertising channels. An in-depth understanding of user engagement over the digital platform remains a pre-requisite to achieve these key business objectives.

Among many ways of gaining an insight into the user’s digital behavior, analyzing information retrieved from a structured URL is an easy and effective method of understanding customer preferences. Converting the website URL data into a readable & analyzable format hence becomes the opening wedge to the solution.

This article aims to help you understand the process of URL parsing & cleaning for a sample website interaction data.

Businesses generally look to identify patterns in user interactions on the website which can help them identify the profit centers and deal breakers on their domains. Some commonly posed questions to identify these are:

Which section and subsection on the website gather high to witness the highest user engagement in terms of page visits, sessions, active times, bounce rates?
What is the general user translation on the site? What is the most followed path on the site by converters Vs the path translated by abandoners?

In order to answer these questions, the first stage would be to gather the database in any tool you are comfortable with. In our case, all the analyses below can be accomplished using the Hive programming language itself.

A step-by-step approach to Parsing

The first step for URL parsing is to perform URL decoding, that is, to convert all the URL ASCII converted special character values into a readable format. The below function can help decode all such values:

“select reflect(“java.net.URLDecoder”, “decode”, url) as url1 from table”

The above statement will convert –

“https://www.google.com%2Fsearch%3Fq%3Dhello%2Bworld”

“https://www.google.com/search?q=hello+world”

Once the decoding of the URL fields has been consummated, we can proceed with the parsing of the URL. But before parsing a URL it is important to understand the constituting elements of a URL.

Example URL — “https://www.toyotagm.in/dodge/caravan/ctv 20?year=2019;type=leather;”

The URL contains three main parts –

Host — www.toyotagm.in

Path — /dodge/caravan/ctv 20

Query — year=2019; type=leather;

Using a “parse_url ()” function we can distinctly extract the different URL features.

“select parse_url(URL_column,’HOST’) AS URL_HOST, parse_url(URL_column,’PATH’) AS URL_PATH , parse_url(URL_column,’QUERY’) AS URL_QUERY FROM table;”

Each of these elements may contain multiple levels of information which might require further segregation. For example, in the path field in our URL Dodge is the brand, caravan ctv 20 is the model, 2019 is the car model year and leather is the car interior type.

To extract all of this from the URL these sublevel information “split ()” and “regexp_replace ()” function can be used.

Now first we’ll use “parse_url ()” to extract path and query and later use the “split ()” function to get the specific element.

Usage :split(parse_url(URL_column,’PATH/QUERY’),’DELIMITER’) [ELEMENT POSITION IN THE SPLIT ARRAY]

“Select split(parse_url(URL_column,’PATH’),’/’) [0] AS car_brand from table”

So, the above will work in two steps first, it’ll extract the path of the URL and later split by “/” and select the first element. Similarly, we can use “regexp_replace (URL_column,’/’,’ ‘)” and replace “/” with space and get the car model name.

Similarly, elements can be extracted from the query part of the URL, regexp_extract can be used.

Usage: regexp_extract(URL_column,’regular expression pattern’’,relative position)

Example: To extract year the command will be as follows:

“select regexp_extract(http_referer,’(year.*?)(\;)’,1) as year from table”

A similar approach can be adapted to identify the various sections and custom search queries conducted by the user during a web session on a domain.

Subsequently to identify the time spent on the website and the number of pages visited by a user the above data must be sessionized. User identifiers and timestamps corresponding to the different user interactions are essential prerequisites for the process of sessionization. If the user and temporal information are available, data can be sessionized in Hive using the following code snippet:

“SELECT *

, CONCAT(user_identifier,

CONCAT(‘_’,

SUM(new_session) OVER (PARTITION BY user_id ORDER BY dt)

)

) AS session_id

FROM (

SELECT *

, CASE

WHEN UNIX_TIMESTAMP(dt)

- LAG (UNIX_TIMESTAMP(dt))

OVER (PARTITION BY user_id ORDER BY dt) >= 30 * 60

THEN 1

ELSE 0

END AS new_session

FROM toydata_hdfs

) s1;”

Finally, to analyze the path traveled by the users on the website, “parse_url ()” and “split()” functions are again brought into play to identify the different sections and subsections on the site. User and sectional data can then be ordered by timestamp and concatenated using “CONCAT_WS ()”.

“select CONCAT_WS(‘>’,collect_set(dt)) from table”. This will collapse multiple rows into one with values separated by “>”.

These pieces of information can be used to get a deeper understanding of the user behavior on the website which can then be employed for planning and optimization purposes.

URL parsing in Hive

A step-by-step approach to Parsing

Written by Shrey Pachisia