
Igor Elbert on Data Science and Analytics at Gilt.com
Igor Elbert, Distinguished Data Scientist at leading online fashion retailer Gilt.com, talks about the company’s recent data projects, challenges and approach to ETL.
How is your team structured? How does it work with the rest of the organization?
I lead a small data science team which is embedded into a larger data team. Our charter is to find insights in the data Gilt collects. We are looking for trends that help us spot best practices, make better buying and selling decisions, and understand our customers.
We closely collaborate with other teams within Gilt that deal with personalization and marketing analytics.
What are your most important challenges related to data?
The main challenges are data quality and availability. For example, I would love to have richer metadata on the products we sell — it would help with demand prediction, personalization, etc.
What type of data architecture does Gilt uses?
We use Aster from Teradata for our data warehouse. While most of Gilt’s services and production database are on the cloud, the data warehouse runs on hosted hardware.

How much of your team’s time is spent on “data preparation” versus analyzing data and drawing insights?
Fortunately, our data team includes dedicated engineers and a homegrown framework that allows very quick integration of data sources. Still, my teams spends about 50–60% of our time understanding, transforming and cleaning the data. A couple of years ago, this percentage was higher — around 80% — but now we can often reuse data preparation queries from previous projects. Some aspects of the data preparation have become part of our ELT process, which also reduces the effort.
What do you believe are the main factors that cause data preparation to be such a pain point?
The main factor is that data preparation is inherently complicated problem. We need to understand all the concepts that data represents and the interrelationships between them. Somebody needs to make a judgement call about the quality of the data, detect and fix multiple issues, think about meaningful way to deal with missing values and data points, outliers, and more. That takes time and some aspects are hard to automate.
Are you using ETL tools for your data preparation?
We have a homegrown ELT (not ETL) framework, which allows us to quickly bring data into our data warehouse, clean it and make it available for analysis.
How much data do you collect, process and act on, on a regular basis?
We collect every click and every event, so the data volume is substantial.
What are some interesting projects you’ve worked on recently?
One of our recent projects involved finding optimum prices for products in a sale based on a predicted demand. Now, the algorithm has become a trusted partner for our merchandisers.
They specify a combination of goals, like revenue, margin, sell-through, etc., as well as other constraints. Then, the algorithm suggests the best combination of prices. It considers hundreds of product and sale attributes to predict expected demand for all possible prices. Most of the products it deals with are new, but we have a rich history of similar products so we are able to fit an accurate model.
The data preparation is done in Aster using SQL and SQL-MapReduce. The modeling, scoring and optimization is done in R. Most of the tasks are run in parallel on workers from our database cluster.

What were your most important considerations when designing the solution and architecture to implement it on?
Our Aster database was a good match for the project. It allows us to quickly process large amounts of data, run hundreds of ad-hoc queries during the discovery stage, and hide the complexity of the solution by presenting the result of a SQL-MapReduce project a table. This way, we use SQL for what SQL does best — data manipulation — and we use R for modeling, scoring and optimization.
What were the criteria you applied when choosing the solution?
Aster is our MPP database of choice at the moment. It is our default choice for large-scale data manipulations. R was picked because of the multitude of available packages including interfaces to popular modeling and optimization libraries. R’s dplyr package, in particular, is a pleasure to work with.
Were there any mistakes you encountered which you can warn us not to repeat?
Design for change. Business rules, constraints and goals that you started with will change before you’ve even finished coding. Business is very dynamic — a good solution should accommodate for that.

What’s the best advice you can give to someone facing a similar use case?
Enable business users to be as self-sufficient as possible. They should be able to tweak the parameters and get results.
What changes have you seen take place in the job requirements of a data / BI pro since you first started in this field?
We’re expected to handle both technical and business aspects like never before. At the same time, a new breed of technically savvy business users has emerged. It raises the bar for collaboration, as well as the requirements for data quality.
What are the most important tools for companies to implement if they want to succeed in fulfilling the data missions of tomorrow?
Every successful tool needs to advance the idea of “data democratization” — making data easily available to everyone who makes any kind of decisions or answers any questions where data access would support them.
Next, tools need to make data-driven decision-making possible for companies that are currently only “data-aware” at best.
Then, as a next step after BI, practical machine learning should become more readily available to a wider audience.