Building a fit-for-purpose modern data stack.
Welcome to the first of many blog posts from the Data & Analytics team at Zip. We look forward to sharing our experiences, successes, and failures in the data space and hearing back from you on the topics we share.
It seems right that our first post as a data and analytics team introduces our data platform and journey over the last 18 months transitioning from a fairly standard lakehouse implementation (S3, custom scripts and AWS Athena) to a modern data stack (dbt, Fivetran, Airbyte, Snowflake, Census, Snowplow, Airflow). While the platform machinery has changed fairly extensively, our consumption layer has remained relatively constant throughout this with data scientists primarily using Databricks and our analytics team producing BI reporting in Tableau.
I hope you enjoy this write up by our Senior Manager, Data Products Moss Pauly.
Tal Bergman — Director, Data and Analytics
With the new year well underway, I’ve been reflecting on this journey from old to new. We were lucky enough to have a great team to navigate the many decisions needed, as well as access to a fantastic data community here in Sydney, Australia. Even with a supportive community and an internet worth of reading material, I would have loved an in-depth article that went into the decision-making process of building a modern data stack and selecting components.
With that in mind, here’s the article I wish I’d had access to at the start of our journey.
Start with people’s experiences
Chances are that no matter what problem you’re looking at, you’re not the first. One of the most valuable things you can do is learn from those that have tackled this problem before you and reach out to your peers.
What do you like about your stack and what would you do differently next time?
I can’t remember how many times I asked this question throughout this journey. It must have been pushing upwards of 30 across a lot of different companies and people. When you’re early on in the process, you cannot ask this question enough. Generally speaking, people will be far more diplomatic in writing than in conversation, and we found that in order to really get a deep understanding of their experiences with different tools, nothing beat having a chat over a beer or two. We found pretty quickly in these chats that there were common callouts that really helped guide our decision-making.
Another great resource for understanding people’s experience with different components of the stack is talking to vendors who integrate against them — kick-off conversations with a number of providers you’re considering at the start.
Everyone in the modern data stack space I’ve talked to is really friendly and passionate about data and the challenges around data. If you’re talking to someone about egress, ask them what data warehouse most of their customers are using and what trends they’re seeing. If you’re talking to someone about a data warehouse, ask them what they’re seeing for transformations or egress. These vendors are privy to a bird's eye view of the landscape, and that’s really valuable to tap into.
Decision framework and process
We didn’t want to make decisions on components in this stack lightly. Rigour is really important here. At this point, I’ll cover how we evaluated decisions and some additional considerations.
The biggest benefit of the modern data stack is tight integration with each component solving their domain excellently. Decisions around components become easier as you lock in more components as you know exactly what they’re integrating with. At the start, you don’t have this so you need a clear vision of what the problem spaces are that you’re solving and what players you might consider. As a starting point, we considered the following:
- Event Collection
- Data Ingress
- Data Warehousing
- Data Transformation
- Data Egress
Cost scalability is a key consideration for us. We’ve been burnt before with event volumes so we went into cost scalability with eyes wide open. We evaluated this in the following way:
- SaaS that can migrate to open source is a massive plus. This means that we can reduce time to value initially and always have an option to control costs if we need.
- Any paid component is evaluated at 1x, 2x and 4x expected volumes. This gives us an idea of the economy of scale.
- All decisions are made after we’ve done a POC, got our hands dirty and actually played with it. Some things are great on paper but the workflows can be sub-optimal.
- Anything we want to deploy and manage ourselves has to run on a container platform and slot in with our operational tooling.
- Look at the supporting community for each tool. The larger and more accessible it is, the better.
- Listen to when people talk about how delighted they are with something and follow your gut.
Lastly, document your decisions thoroughly. You’ve probably spent weeks researching, comparing and testing options here, so this is in your best interests.
- Capture the options you considered with the pros and cons for each, and a clear articulation of your recommendation. It helps to clarify, compare, and take decision-makers on the journey.
- You (or someone else) may need to return to a decision in the future, and it helps to restore a detailed understanding of the context available at the time.
- It’s likely going to result in you asking for an investment from your business, so having an in-depth articulation is much more likely to get you money than a strong verbal suggestion.
- If you have a process in your organisation for socialising and endorsing strategic decisions, use it. If you don’t, set up something lightweight that involves key stakeholders (and budget approvers).
Decision #1 — How are we going to transform our data?
We started here as we wanted to nail how we store business logic and transforms to build data models in a scalable, future-proof way. We didn’t want this decision bounded by constraints on our data warehouse, ingress, or egress compatibility.
We had the following core requirements:
- The tooling needed to be able to support DAG type dependencies (n) layers deep.
- It needed to be SQL-based.
- We needed ways of tying documentation (e.g. metadata) directly to models in a way that we could check that they stayed in sync as part of our PR process.
- Capitalising on existing models for third-party data sources would be nice to have.
Dependency management and documentation were both significant pain points of our existing transformation stack which in essence was Airflow orchestration of SQL, through Athena, on an S3 Data Lake. While we looked at a few options in this space, dbt was the clear winner here. The sentiment around it is fantastic and this became very clear quickly in conversations we had with other dbt customers.
With the framework we mentioned above, we went through a POC before documenting the outcome and recommendation to proceed with dbt. We then formalised a decision with our stakeholders.
After 18 months and with 1000+ models in production, we’re pretty happy with how dbt’s going. We’re currently on the cloud instance and the new hosted IDE is a game changer. We hit scaling issues pretty badly with our project size initially, but these are mostly resolved now.
Top tips for setting yourself up for success with dbt:
- Enforce documentation and test requirements from the start with dbt_meta_testing.
- Read gitlab’s dbt guidebook, it’s an incredible resource.
- Spend the time to go deep on understanding macros and incremental models. They’re both insanely powerful.
Decision #2 — Data storage and compute
The second component that we looked at was the data warehouse itself — the compute and storage that dbt runs on. Given that we have a data science community operating in Databricks, we were keen on evaluating Delta Lake as an architecture alongside Snowflake as a data warehouse. We did extensive POCs on both, as we knew that this was not a decision we’d reverse easily and that it’d have an enormous impact on the quality of life of all data practitioners at Zip over the coming years.
There have been some entertaining fights in the data space of late about the performance of these solutions. As general advice, I’d say performance should be at the bottom of your list when it comes to making a decision. The reality is they’re all fast, who really cares about the basis points of speed in comparison to the quality of life working with it day to day?
The main components we evaluated here were:
- If you wanted to run a quick query, what’s the time to result? (Opening the tooling, navigating it if required, waiting for a cluster spin up etc…)
- What granularity of cost visibility can we have easily?
- How well written is the documentation and how easy is it to find answers to questions?
- How well does it integrate with dbt?
- What would the management impost be on the small team responsible for operating and maintaining the platform?
- How easy will it be to hire people with previous experience on this or bring people up to speed from scratch on this?
- How easily can we manage PII redaction in this stack to protect our customers’ privacy?
After evaluating all the above we settled on Snowflake being the right choice for us. One of the bits of advice I got from talking with someone in a company who switched about a year before us was:
There’s so much that really just works in it. Looking back it was close at the time but we’re really glad we made the decision that we did.
After using Snowflake for 18 months, I really couldn’t agree more with this. This is echoed throughout the data community at Zip.
See below for our high-level data warehouse architecture:
Top tips for setting yourself up for success with Snowflake:
- Spend the time to really consider your security model and work through how it can be applied with role inheritance.
- If you went with dbt as well as Snowflake, look at and leverage dbt_snow_mask to dynamically mask all PII from the start.
- Zero copy cloning is amazing for enabling you to test model changes on a production clone. Use and abuse it.
Decision #3 — Event Collection
Our previous stack's inability to deal with the volume of raw events coming in was one of the catalysts for kicking off our journey in the first place. We decided to look in detail at Segment, Rudderstack, and Snowplow. My partner in crime Adam Schmidt (Principal Engineer at Zip) and I had very strong opinions about what we needed in an event collection platform having both worked in this space relatively extensively in the past.
- We did not want a solution where we’d constantly need discussions about what we can and can’t track due to cost.
- We must be able to enforce schema compliance on events.
- The solution must be scalable to billions of events a month and must be able to handle sudden spikes in volume gracefully.
- We wanted to have first-party ownership of our events.
We decided to do a POC for Snowplow due to a strong preference for open-source solutions in this space, and Adam deployed it to production. After extensive evaluation, we decided that we were happy to take the overhead for managing this infrastructure ourselves in favour of not having a SaaS cost tied to Events/Month. This is the first decision I’ll outright say is not for everyone:
- Snowplow is incredibly flexible, but that’s a blessing and a curse. There’s a bit of a learning curve in understanding how to configure and optimise the pipeline, and you need to have well-supported streaming infrastructure (in our case Kafka) at your disposal.
- Our platform and tooling are mature, and we have a reasonably sized engineering team. If you’re a small-ish organisation, start with the cloud offering.
- We’ve had to find some creative ways to provide tooling for developing/testing/troubleshooting event implementations
In addition to deploying the pipeline, there are a few somewhat special things about the way we integrated Snowplow which I’d strongly recommend implementing, regardless of how you collect events.
Top tips for instrumenting any event collection framework:
- Traceability on every event collected to a specific system and team. We do this in Snowplow leveraging app_ids to each service. Any events that come through with an app_id not registered in our system get rejected. Additionally, every event is enriched with a service_name and team_owner based on a lookup on the app_id. This means that we can trace every single event in our platform to a team owner. Even though we have 20+ services sending us page_view events, we know exactly which team own each and every one that comes through.
- Abstract the integration to enable centralised standardisation. We built out zip_analytics.js which is an opinionated abstraction library on the Snowplow SDK as well as a number of other analytics libraries. This makes our life so much easier than trying to coordinate 20+ teams to make a config change in their tracking implementation, and forces a predictable consistency in how event data arrives on the platform. This has already paid dividends with us rolling out a number of tweaks to how we track events over a very significant number of integrations with almost no coordination work needed.
- Ensure that the data exiting the pipeline can be loaded into your warehouse with minimal fuss, especially in the early stages where you’re iterating and extending your event models frequently. The loader should be as dynamic as possible to reduce engineering effort in adding and modifying events/contexts.
- Spend a bit of time up-front ironing out your principles for event namespaces, event names, and schema conventions. Use your most popular customer feature as a real-world test bed for how you intend to work with events going forward.
Decision #4–Third-Party Data Ingestion
We deliberately evaluated solutions for first-party and third-party data separately due to the following reasons:
- Our appetite for an acceptable cost per data point is different between first-party and third-party data due to the massive differences in volume between these.
- Data modelling can theoretically be a solved problem for a lot of third-party data sources, this is not the case for first-party data.
If you’ve been in the data space for a bit, you’ve likely had to deal with reverse engineering modelling the data out of a third party. Maybe this wasn’t too bad (shoutout to Twilio’s data structures ❤) or maybe you spent months of your life trying to model a Salesforce data feed (you have my sympathy). We were keen to capitalise on standard models in dbt for third-party data sets where possible.
Fivetran was active in this space, so we decided to do a POC with them. We evaluated a few other options on paper, but Fivetran came out ahead. This decision focussed on:
- Coverage of our core third-party data feeds.
- The shape of landed data structures.
- Documentation of data structures.
- Prepackaged dbt transform availability.
In all the above situations we were happy with how Fivetran performed during our POC.
Recently, our CIO was wanting to query Twilio data and pinged me about the Twilio table structure while I was getting coffee. I was able to send back a link to the Twilio ERD in Fivetran about 5 seconds later that fully explained everything. Well-documented third-party integrations are really valuable.
Top tips for Fivetran:
- We pulled out some of the prepackaged dbt transforms from Fivetran that we initially had in for IDE performance reasons. I suspect this isn’t as much of a problem anymore with dbt’s new IDE though. Moving forward, our sensible default is “add them when you need them not when you can”.
- Use Fivetran to send its own usage logs to your warehouse. It gives you a nice set of tables to monitor consumption and usage centrally in your BI tool of choice.
Decision #5–First-Party Data Ingestion
At Zip, we have a pretty significant amount of first-party data. This is highly distributed through a core monolithic SQL Server database, Kafka event streams, Postgres databases, and DynamoDB tables that back our microservice architecture. We needed a scalable way of dealing with the rapid ingestion of all these in an hourly fashion, and the cost of doing this volume of data in Fivetran was not feasible for us so we started looking elsewhere.
Similar to paying for event ingestion, we really wanted to avoid needing to have conversations about what we do and don’t ingest based on cost. We looked again at the open-source community here and discovered Airbyte, which was a relatively young open-source project that looked promising.
Our requirements were as follows:
- Support SQL Server
- Support of CDC on SQL Server
- Support Postgres
- Support S3 Ingestion
- Snowflake Destination Support
- Cost Efficient
- Sub 1-hour integration of a new data source and/or tables
- Connector-based with a rich collection of pre-canned connectors, and the ability to build our own if we wanted to.
Airbyte fitted these requirements pretty well, so we started our POC again deploying to our production environment. This journey was a little bumpier than some of our others, with one of our data engineers submitting over 10 PRs to the project remediating some issues with the MSSQL connector. The Airbyte community and core maintainers are really active and supportive, so getting the changes merged was not a problem.
That aside, we’re really happy with where we landed. Ingestion is quick, it’s stable, and it meets our needs. We’re doing hourly ingestion from a number of databases big and small and we haven’t needed to touch this much after the initial setup.
Decision #6–Data Egress
We started a conversation with Census in this space back in the early days when we were looking at Snowflake and dbt, although this is one of the most recent additions to the data stack.
We ended up waiting until we had a number of use cases we could solve with this tool to make the business case and get this across the line. We ended up doing a full POC on Grouparoo and Census. Grouparoo is an open-source project recently acquired by Airbyte, due to our existing experience with that project and our preference for open-source solutions it made sense to see if this would work for us. While Grouparoo initially looked promising, we decided we weren’t comfortable with its maturity. The issues here were primarily due to connection instability during scaling and the methodology for existing record checks in the Braze connector.
I can’t wait to see what Airbyte does with Grouparoo in the near future, but it didn’t fit our needs at the time of evaluation. A lot of our use cases for data egress revolve around shipping data into Braze, our CRM platform.
The key aspects we were evaluating are below:
- Sub-3-hour updating of 4 million records in Braze
- Robust updating of only records that have changed (Braze charges per data point updated so this is a must for cost efficiency)
- Friendly UI to empower product and marketing teams to control what data ships where.
We were delighted with Census from the start and decided to move forward with them — a decision we’re really happy with. Since then, we have started working on some creative aspects with their segment builder empowering our product and marketing teams, which I’m sure we’ll share in a future blog post.
Top tips here:
- Triggering Census syncs post-completion of dbt jobs is your friend for seamless integration.
- Entity models have been released recently. These are game-changing when it comes to enabling non-technical people to create segments that would normally need to do complex joins across a large number of data sources.
What’s next?
The above forms the core of the data platform at Zip.
However, I view this as really just the entry ticket into being a modern data-led company. We’re currently well underway with a host of next-phase activities built on a strong foundation:
- Enabling our data science teams with Feast as a feature store
- Visualising model performance using Streamlit applications
- Evaluating data catalogue options to move data trust and data definitions from being predominately tribal knowledge to explicitly documented and easily accessible
- Monitoring and alerting for data SLA breaches, data profiling and core funnel performance across 40k merchants.
I’m proud to have been part of the journey of modernising the data stack at Zip, and couldn’t have done it without the outstanding effort of the data engineering team (special shout out to Adam Schmidt, Kiran Reddy, Nick Heron, Thomas Gao, Alex Crawford, Frank Fernando and Sivakumar Ramaswamy), data analytics team, and data science teams. Additionally, this would not have been possible without the data leadership at Zip (Tal Bergman — Director of Data and Analytics and Adam Finger — CIO) recognising the opportunity in this space and backing our journey into it.