How I spent my summer at Xandr: Building a Data Analytics Pipeline

Published in

Xandr-Tech

4 min readSep 4, 2020

The summer, after the third semester of my graduate studies, I was offered an internship opportunity at Xandr which I was so excited and eagerly waiting for. Although the internship was virtual due to the pandemic, I am really happy that I had a great learning experience and never felt alone.

During the first week of the internship, we had several onboarding sessions which were really helpful to get a deeper understanding of various aspects of the Ad Tech industry and get familiarized with the culture at Xandr. During the second week, I started working with the identity team and was introduced to my project.

In the modern era, a single individual can have dozens or even hundreds of digital identifiers — ranging from mobile devices and tablets to social media accounts and online shopping carts — and third-party cookies all uniquely identify the individual to some service provider. The ability for advertisers to link together these disparate identities and understand the consumer journey across screens and devices enables even more effective consumer targeting strategies.

The identity team primarily works on building Xandr’s Identity Platform Architecture (IPA) which enables processing and understanding these relationships across millions of identities collected from partners and persisted into an identity graph database. Based on these relationships, a group of identities that are valuable for our advertising partners is created. One such group is a “household” which represents a group of identities that belong to people who all live in the same home.

My project focused on identifying meaningful graph metrics to measure and track how the graph changes over time. This is vital in preserving graph quality and ensuring that future data and algorithm changes do not unexpectedly and drastically modify graph properties. The objectives for the project included identifying meaningful graph metrics, designing algorithms to collect those metrics, writing jobs to collect and store these metrics, and creating visualizations.

Three primary graph metrics that were identified as useful for tracking the identity graph include:

Count of operational identifiers that drifted with respect to a synthetic household when compared with the previous day and count by the type of operational identifier.
Count of operational identifiers that belong to a single synthetic household (“orphaned identifiers”) on a given day and count by the type of operational identifier.
Household size distribution (The count of households with a particular household size).

Schema of the Data Source on Hadoop stored as Hive tables

Using presto queries, I performed analyses and attained insights on the tiger graph data stored on Hadoop on an ad-hoc basis. However, this is very manual, and a robust analytics pipeline can enable more programmatic and complex analyses. We decided to build a pipeline for the analysis of the tiger graph data stored on HDFS as Hive tables.

The End-to-End workflow is as follows:

A DPaaS job on a daily basis will load the data from export_wm_synthetic_groups_pb table and load the data into a dataframe.
This data frame can be used to run operations against and collect relevant metrics.
These metrics are published to data science-cia_metrics MySQL table.
MySQL table data is parsed and the data in the table is used to create visualizations in Power BI.

While working on my project, I faced a number of challenges. For example, I was not aware of what a DPaaS job was or how it works. Thanks to the help of my team members, I was able to quickly overcome such challenges. Working on a real-time project at Xandr has given me the opportunity to understand the whole development and deployment process of applications as well as how Agile methodologies and CI/CD tools like Jenkins, Kubernetes, and Docker can be leveraged for a fast and effective process of delivering a product.

Apart from working on the project, I took part in the intern Console Challenge, which gave me a chance to put myself in the shoes of a trader and learn how Invest DSP works. I attended various teach and learn sessions that focused on public speaking, networking, predictive analytics, and tiger graph training, all of which helped me flex both my professional and technical skills. We also had several fun game nights and I enjoyed meeting with new people. In particular, I really enjoyed how Xandr teach and learn sessions allowed me to stop worrying and comfortably ask questions.

My experience at Xandr was both exciting and challenging. The internship was a great learning opportunity and pushed me harder to learn many new technologies. During the internship, I got a chance to meet many wonderful people, discuss my career path, listen to their experiences and their suggestions, all of which helped me get a clear perspective of my future career opportunities.

How I spent my summer at Xandr: Building a Data Analytics Pipeline

Written by Harichandana