Scaling the data infrastructure to support your growing company

Louai Ghalia
ekoEngineering
Published in
10 min readJul 15, 2020
(Illustration by Itai Raveh)

All companies start small

When you’re a small company — you have small data needs. You have a few stakeholders like management, product, or support who need to make business decisions based on data. The process includes tracking events in your software, leveraging popular pixels such as Facebook or Google, creating dashboards, and using them to gain insights to drive those business decisions.

Luckily, many existing SaaS tools allow you to quickly tackle each of those steps without putting too much thought and effort into set up, scale, and integration. We started our data-driven journey with Mixpanel, which suited our needs until the business began to grow and required more flexibility and capabilities from our data infrastructure.

<diagram of phase 1 of our data infrastructure>
<diagram of phase 1 of our data infrastructure> (Illustration by Itai Raveh)

Starting to scale

As your company starts to scale, more challenges accumulate in the realm of data.

From the business aspect, you have more stakeholders, partnerships, and business limitations. Your clients might also need to use or gain insights from the data you collect.

From the technological aspect, as the amount of data grows, so does the difficulty in storing, retrieving, and querying it. Luckily, most 3rd party services are well equipped to handle this type of scale. The downside would be the high costs associated with scaling through these external providers. In addition, the insights provided will be limited by the capabilities and paradigms of the service you chose.

When we started to scale our data operation at eko, we first had to tackle the fact that all of our data was living on external services instead of being hosted by us, which restricted our options. We wanted maximum flexibility over how to run queries on our data warehouse and avoid any limitations set by the external service provider for viewing and analyzing data.

At eko, we prefer to put effort into developing software that directly powers our core business domain, Choice Driven Experiences, rather than pieces of code that don’t deal directly with it. This is why our next step was selecting a scalable, managed infrastructure that allows us to own the data and control how it is collected, enriched, validated, and stored. That way we wouldn’t have to worry about that ourselves…

We ended up going with Amazon’s Kinesis Firehose to stream data in Redshift and developed a pipeline that included a JS-based client-side tracker transferring event data to a cloud-based system, which then fed into a Redshift using Firehose and S3. We achieved our goals by connecting and integrating proven services. This scalable system handled 200,000 events a day easily thanks to the underlying services dealing with the heavy lifting, allowing us to focus on developing the technology backbone for our core business.

Extending the system with external tools

As our business grew, our requirements for data evolved, so our data infrastructure needed to evolve as well.

Growth meant more systems, processes, and stakeholders — each with their own set of tools and data sources. To collect and analyze all these different sets of data, we required tools that provide more flexibility than the mere basics. We ended up using Stitch, an ETL service that includes multiple integrations and allows for much versatility. It gave us a solution to rapidly move the data from one source to another and get cross-systems insights.

Getting raw data to the data warehouse is just the first step. To analyze it and get real insights, it has to be processed and transformed. That came in the form of a DBT, which allowed us to construct an abstraction layer over the raw data.

From the data consumer perspective, we chose Mode Analytics for our dashboarding tool. Among its features are the abilities to run SQL and script programming languages like Python and R, and presenting graphs and visuals, which we can then export and embed in our internal CMS. We looked at several alternatives that presented better graphs but offered less powerful SQL or programming scripts interpreters.

<diagram of phase 2 of our data infrastructure>
<diagram of phase 2 of our data infrastructure> (Illustration by Itai Raveh)

This phase of infrastructure was scaleable and served us quite well for a while, but the time was at hand to make another evolutionary leap of our data infrastructure.

Playing with snow

From the start, one of the requirements was getting real-time insights from viewers watching our content. Redshift wasn’t fast enough to accommodate this, so we had to develop an additional solution.

It’s a pivotal moment when your organization doesn’t just use data for itself and its decision-makers but provides data solutions to the customers. This often means your service must also be compliant with rules and regulations such as GDPR and COPPA, which implies another set of requirements and limitations. We had to overcome these challenges when developing the platform and technology for the Walmart Toy Lab, an online interactive experience that invites kids to test this year’s most-wanted toys.

The General Data Protection Regulation applies to anyone collecting data from EU citizens. It has two broad categories of compliance; data protection, which means you have to keep user data safe from unauthorized access, and data privacy, which empowers users to decide for themselves about who processes their data and to what end.

The Children’s Online Privacy Protection Act is a law that protects the privacy of children under the age of 13 in the US. COPPA establishes stringent guidelines for online businesses and considers everything from an IP address to geolocation to photographs to be personal information.

To support these challenges, we required flexibility. Besides limitations in speed, our second caveat with Redshift was that the data was based on columns, which limited the type of data we could insert. One of our interim solutions was to use a generic spillover column to store any extra data collected, but obviously, this was not an ideal solution (more on that later).

With the company changing strategy to become more data-driven, it was clear that we had to refactor the pipeline and store our data in a warehouse that offered better solutions and UX to our data analysts.

This led to our current architecture. It’s based on Snowplow for pipelines and Snowflake as a DW. (Fun fact: while both products carry the same arctic theme, the companies behind them are unrelated!). Together, these offered a solution based on open source and a solid architecture. It gives us control over the customization of various system components and provides several benefits over our previous pipeline.

<diagram of phase 3 of our data infrastructure> (Illustration by Itai Raveh)

The things we like about Snowplow:

Separate collection and analysis

The Snowplow paradigm consists of six loosely-coupled sub-systems connected by five standardized data protocols/formats, which simplifies the task of designing and implementing data pipelines for specific use-cases. The use of strict, structured event data that is validated on the backend and later enriched forces us to be more disciplined and thoughtful about the data we’re collecting and the pipeline it passes through.

While we did have client-side data validation in earlier iterations of our data infrastructure, sometimes data analysis concepts got mixed into data collection practices. Snowplow’s mental model solves this by allowing for a strict separation between data collection (which only includes known state) and enrichment + analysis properties. It provides flexibility with the structure of data you want to store thanks to the ability to customize the JSON schemas.

(Illustration by Itai Raveh)

Flexibility with events

Events are stored in near-real-time, whether they’re “good” or “bad” (“good” and “bad” being the term for compliance with the validation of the Iglu schemas). Snowplow allows you to store these invalid events in a “bad raw events” warehouse (in our case — hosted by Elasticsearch). The combination of speed and flexibility in storage allows us to identify potential data/schema related bugs in real-time.

This means that on the one hand, you get the benefit that comes with saving all events, but on the other, you enjoy the comfort of knowing that your data is sound, as events are validated using iglu schemas. This makes people happy on both ends of the pipeline (collection and analysis).

Elasticsearch offers a way to run search queries efficiently with good results. This comes in handy when searching the enormous amount of tracked events. One can easily query events that start with a certain name using regex and wildcards, or coming from a certain application.

You can define multiple contexts for each event, which automatically translate to a column in your DW, which can then be queried. A new context can be added further down the road for maximum flexibility.

Here’s an example that demonstrates this: let’s say a user makes a decision during the narrative of one of eko’s interactive experiences. We want to track this choice event along with properties related to the decision, so we define a “decision context” that holds these. We define this iglu schema for a decision context:

At a later point we figure that for every such event we want to store information about the project that contains this decision in addition to data about the decision event and so we define the “project context” schema:

And so we track an event to Snowplow collector using the two separate schemas:

Snowplow is open-source

Which means we have a proven, industry standard solution which we can extend and fine-tune to fit our use case. We can mix-and-match different entities in the Snowplow ecosystem to have full control over the topography and configuration of our pipelines. Being open source, we also have the opportunity to contribute back, and we even forked and altered a few repos to better match our work environment.

Snowplow’s thriving and active community means 3rd party tools are available for testing and debugging the data pipeline. One of the more useful ones we’ve used is this open source Chrome extension that listens to fired events from the Snowplow tracker to the Snowplow collector.

And as for Snowflake — it too makes us happy in several ways:

  • With Redshift, we were limited by the structure of our data and had to create a generic column and use it as a “dumpster” for all new attributes — this made it harder to add data and perform queries. Snowflake allows you to easily store JSON objects as fields using contexts and query them. You can even query nested fields in a JSON!
  • The ability to separate the storage and computation units. It’s up to you to decide how you want to scale each one of these and with which cloud service. This, in contrast to something like Redshift, which as an AWS product, can only be used on AWS using AWS machines (such as EC2).
  • You can abstract away the data querying computation infrastructure. You do this by creating virtual warehouses or clusters for query execution or data loading. A neat feature is Snowflake automatically suspending those when you don’t need them to save money (and energy).

Snowplow and Snowflake have another big plus for us: Complete control over which cloud services to use. The collector, DW, computation unit, etc., can be hosted by Amazon Kinesis, Google Pub/Sub, Apache Kafka, or NSQ. With complete control over our data and its infrastructure, we can make sure we comply with the technical and legal needs of large-scale clients (for example, the implications of being COPPA — compliant as described above).

Looking ahead

We’re happy with our current infrastructure, but as you know, business and technology are constantly changing, shifting, and evolving, which means we are always evaluating and questioning our paradigms and solutions.

As we grow, so does the number of our business partners, as well as the amount of marketing and advertising channels we use. This means our data is spread among multiple sources. One way to approach this is to run ETLs to collect from these separate sources and then store them structured in our DW.

Another approach that we’re looking at for the future is instead of running and managing these ETLs, we throw everything unstructured into a data lake. Once data of all sorts and shapes is stored, we can structure it into schemas for a specific analytics view purpose rather than worry about the structure of events in the data loading phase. With this schemaless format approach, we can allow our analysts to structure the data in a way that better suits our different needs, which will make it easier to make data-driven business decisions. After all, that’s what data is all about!

My name is Louai, and I’m part of the eko Engineering dev team. Check out my Developer Spotlight interview if you’d like to get to know me and my work a little better. And, if you’re like us — love “big data” but think the term is an overused buzzword — we should talk!

--

--