Image for post
Image for post

At zeotap, a large amount of structured and unstructured data from data partners is transformed and converted to easily queryable, standardized data sets, which are fanned out to a variety of destinations. For this processing, the transformation jobs use Apache Spark as the distributed computing framework, with a fair share of them being batch processing jobs. Batch processing is defined as a non-continuous processing of data. A lot of the batch processing done at zeotap regularly processes files at a given frequency, with the input data being gathered in between the processing jobs.

We currently use two different types of systems to launch and manage these Spark jobs. …


Image for post
Image for post

Writing tests are analogous to tasting your meal before you serve it. The importance of unit testing is known at all levels in programming but more often is ignored when it comes to UI developers. This post is a brief about how you can start your journey in being a better Frontend Engineer by incorporating these key concepts about unit testing in your code.

Overview

1.Importance of Unit testing

2.Sample App

3.Conclusion

Importance of Unit testing

Writing Unit Tests does seem like an overhead when a you can just test the functionality by using it. For the times you are in such a dilemma you can keep these few points in…


Image for post
Image for post

Monitoring is one of the critical yet under-appreciated aspects of infrastructure. Nobody cares when your servers/applications are running fine, but as soon as your website is down or your application is not responding, the first question that comes to our mind is “‘Why is there no monitoring enabled for this piece?”

For starters, the purpose of monitoring majorly falls into the below categories:


Image for post
Image for post

In the previous post, we learnt why and how to transform our real world, high- dimensional-categorical data to a lower dimensional continuous space, ‘Embeddings’. In this post, we shall learn two ways to generate samples from these embeddings.

PART II : Generating samples

The idea of sampling methods is to be able to sample all such points in an appropriate number that best explains the variance in data, such that, it gives a close representation of input data. The two techniques defined below take embeddings of training data as input and outputs samples. …


Image for post
Image for post

In a real world scenario, we often come across data with high sparsity, a large number of features with a high categorical dimensionality. Due to such complexity of the data, defining a distribution for the same is a challenging task. As the distribution is not a standard one, basic sampling techniques cannot be applied here.

We overcome this challenge by transforming this complex data into a lower dimensional continuous space called Embeddings, followed by sampling techniques that approximate this data distribution.

Part I: Problems and solutions for real-world data

Part II: Generating samples

Why should I be interested in dataset samples ?


Image for post
Image for post

At zeotap, we are using AWS for all our cloud processing needs. By early 2018, we found out that our AWS expenditure had been increasing at an alarming rate with a CAGR of 8.05%.

Image for post
Image for post

This was worrisome. For almost every startup, the cost is one of the top priorities in infra setup (this is where cloud platforms are saviors for them) and saving big expenses also contributes to revenue for a startup. It is a big problem if we are not able to attribute the spends and analyze their worthiness, and that’s when we decided the problem had to be tackled. After closely analyzing it and having implemented a solution, we were able to reduce our expenditure by 50%. …


Image for post
Image for post

I joined the company last year in April into a team of Data Engineers. The team consisted of 3 interns and 3 Data Engineers collaborating using the version control system, Git. We didn’t have much of a structure in place to peer review the code that went into production. We did use pull requests on Github at the time, but we had all the issues in our git repositories that an early stage startup might have. We added inessential files into the repositories (such as .DS_Store) …


Image for post
Image for post

At Zeotap we source data from countless enterprise partners to then refine, blend and convert it to high-performing data segments to be used for targeting. Last year the number of profiles under management grew 10X from 250M to 2.5Bn.

An introduction to our data pipeline was given here — Oozie! It’s pouring data!. In this article we will now further elaborate on our profile store, its purposes and our choice on tech. Our profile store is where every data feed finds it place. It is the heart of our advertising intelligence.

Key Terms

Profile — all attributes of a user ranging from demographic to interest and intent.


Image for post
Image for post

zeotap makes large scale, deterministic data assets easily available within the digital advertising ecosystem and other industries. Our data engineering team transforms unstructured raw data that we receive from a multitude of data partners into structured readily queryable data which can then be monetized through different distribution channels. In this article, I will begin with a high-level introduction to our data pipeline and elaborate on how we use Apache Oozie as a distributed controller for our data platform with some customisations to make it work for our use case.

Our whole data platform runs on Amazon Web Services (AWS) making heavy use of a diverse set of tools and services provided by AWS. We receive data from various data partners every day in designated S3 buckets in pre-decided formats. We then process it on AWS Elastic MapReduce (EMR) using Apache Spark through various stages in the pipeline. The processed data is uploaded to Amazon Redshift Clusters (an OLAP MPP Database) and made available for efficient querying by different teams as well as by various internal APIs that we have built on top of it. All of this is scheduled and monitored through Apache Oozie that comes along with Amazon’s EMR service. …


Image for post
Image for post

Today

Ieva leads zeotap´s business development efforts in North America. As part of the role Ieva and the team engage and test dozens of data partners every week against zeotap´s strict high-quality data standards. Her day-to-day responsibilities also involve leading our identity business globally and working on other strategic business development opportunities.

Background

Ieva joined zeotap in the winter of 2015 and moved to company´s headquarters in Berlin from the ancient halls of University of St. Andrews, Scotland, where she finished her Master’s degree in Finance. Ieva set foot in St. Andrews with an ambition to pursue a career in investment banking, but it was the University´s entrepreneurship professor who inspired her to move into the growing data and mobile advertising industry. …

About

zeotap

Zeotap is a Customer Intelligence Platform that helps brands understand customers and predict behavior.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store