The startup data stack starter pack (2020)
I advise a lot of people on how to build out their data stack, from tiny startups to enterprise companies that are moving to the cloud or from legacy solutions. There are many choices out there, and navigating them all can be tricky. Here’s a breakdown of your options, trade offs, pricing and some thinking points around which you can make your decision, as well as some personal thoughts on the options.
My background
I’m CTO and co-founder at Dataform, I was previously an engineer at Google, where I spent most of my 6 years there building big data pipelines with internal tools similar to what is now Apache Beam. Dataform is a data modelling platform for cloud data warehouses, and while only one small part of the overall data stack, is often the glue that ties many things together and as a result, we spend a lot of time talking about overall data architecture with customers and prospective clients.
Product recommendation methodology
It’s impossible for me to give a completely fair trial to every product in this space. In general, I’ve chosen to highlight products that:
- Have generally high adoption and awareness amongst startups
- We generally hear our customers speak highly of
- Fit into the ELT model of a data stack
- Innovative new products that may not tick the above boxes, I personally believe are worth a mention
Where I have significant experience with a product, I’ll let you know and provide more detail on why. Similarly in one or two cases I’ve shared my reasons for not recommending them.
Overview
There is a prevailing model of a data stack that we consistently see the world moving toward, that’s probably best summed up by this diagram. This is an ELT architecture (extract, load, transform) as opposed to a more traditional ETL architecture, and can support companies of all sizes (perhaps with the exception of extremely large enterprises).
Event data collection
How do you collect event data from across all of your different applications, web, app, backend services and send them to other systems or your data warehouse.
Conceptually straightforward, so not much to say here! Event based analytics is usually the easiest place to start and most off the shelf solutions are built around this.
Tracking everything that you want to use for analytics in events avoids needing to join in other data sources at analysis time and lends itself well to product analytics where ordering of events is important to consider.
Data integration
How do you move data between databases and services? There is some overlap with collection here. Typically you need to move data between various places such as:
- SaaS services > Data warehouse
- Production DBs > Data warehouse
- Event collection > Data warehouse / SaaS tools / CRMs
- Data warehouse > SaaS tools / CRMs
For the rest of the article we’ll consider these as two different data integration problems:
- Data integration to the warehouse
- Data integration to other SaaS products
Data warehouses
Where you move all your data to so you can query it together.
A lot about data warehousing has changed over the last 10 years, data warehouses now scale to unprecedented levels. Before Snowflake and BigQuery, organizations with truly massive data would have avoided them due to limited scale, and instead opt for solutions such as Apache Spark, Dataflow, or Hadoop MapReduce like systems.
Warehouses and SQL have many benefits and the scalability limits are (mostly) gone. Additionally with the rise of engineering inspired data modelling tools (such as Dataform), it’s possible to manipulate data via SQL in a well tested, reproducible way.
We’ve written about this change if you’d like more information on why we think the shift towards SQL based warehousing is the right one and how it can help you move quickly, especially as a startup!
Data modelling
How do you actually transform data from many different sources into a set of clean, well tested data sets?
ELT introduces a new problem, you end up with a data warehouse full of messy datasets from your newly set up data integration tools and no idea how to use them. This is where data modelling comes in, and if you are building a stack with a data warehouse at the center, it needs to be addressed.
Data visualization & analytics
Once you sort out all of the above, how do you actually use that data to answer business questions or do advanced analytics?
For any company, and particularly startups understanding how your users use your products, how much time they spend on your app, what your signup and activation rates are like is obviously important.
I’ll mostly cover the business intelligence and product analytics side of this, and avoid more advanced ML and data science applications as this usually comes afterward.
Do you need a data warehouse?
Once you hit a certain level of complexity, or need complete control over how you join and mutate data before sending it to other systems you probably want to move to a model where the data warehouse becomes your source of truth for business data. This gives you the most power as once all your data is there, you can do pretty much anything with it.
For very early stage startups, I recommend that you avoid this initially. Off the shelf product analytics tools will provide you with the insights you need without the extra work during the early stages.
In this model, you typically require just two components:
- Data collection (events)
- Visualization / analytics
A concrete example of this would be something like:
Segment > Mixpanel / Amplitude / Heap / Google Analytics
Using a product like Segment gives you the flexibility to move that data to your warehouse in the future, if not immediately.
As this is probably something you’ll need to do at some point, I’d recommend using something that can do this from day one, as moving from something all in one such as Google Analytics to something custom built around a warehouse can be difficult or expensive (Google will charge you a lot of money to move your raw GA data into BigQuery).
Data collection
Data collection products allow you to track events from various apps, and capture user activity, pageviews, clicks, sessions etc. This is not about collecting data from e.g. your production DB (see data integration).
Segment
The market leader, with some good out the box data integration solutions.
Segment is great, and we use it ourselves. Segment is more than just data collection. Their somewhat open-source analytics.js provides unified APIs for tracking events in pretty much any system.
Events from Segment can be sent to your warehouse, but can also be sent straight to other systems, for example Google/Twitter/Facebook/Quora Ads, most major CRMs.
Segment can get really expensive really quickly, particularly for B2C companies with high numbers of monthly tracked users. However as their core APIs are open-source, it’s possible to migrate to your own infrastructure.
Definitely worth mentioning RudderStack here, an open source host it yourself alternative to Segment that uses the same open-source APIs. When your Segment costs start to exceed an engineering salary, this is probably the time to consider such alternatives.
Snowplow
Event data collection done right, and it’s open-source.
Snowplow excels at event data collection, period. It lacks the out of the box data integration solutions of Segment, but arguably has a much more rich feature set when it comes to actual event tracking, such as validation of event schemas.
It’s open-source, so you can run and manage it yourself if you want to.
Snowplow won’t help you push data to your CRM or other SaaS products, but there are other options here we discuss below.
All in one solutions
Segment and Snowplow are primarily designed for data collection. There are other tools that will help you collect data too, but they are part of what I’ll call “all in one” data analytics packages. I won’t cover these here, but have mentioned them below as part of Data visualization and analytics.
Data integration to the warehouse
Data integration products allow you to move data from one source to another. This section covers products that help you move data to your warehouse.
Some of these products also act as data transformation tools (traditional ETL). We don’t recommend this approach, preferring a more software engineering (SQL / code based) approach that will scale as you grow out your data team (see data modelling). As a result, our recommendations are for data integration products that are designed with an ELT model in mind.
These products move data from other data sources such as CRMs, Stripe, most popular databases (Mongo, MySQL, Postgres etc) into your warehouse, so you have everything in one place and can join it all together and perform advanced analytics queries on the results.
Stitch
The best option for early startups, teams who want to write their own integrations, or value open-source.
Recently acquired by Talend, of which I’ve seen little impact so far — for better or worse. We use them ourselves, and they serve the majority of our integration needs very well.
Simple self service onboarding, reasonable usage based pricing. Open-source core, you can write your own singer taps, and run it yourself.
Fivetran
The best option for those who are willing to pay a little more, or with certain data sources.
Historically a more enterprise sales model, recently changed to variable/volume based pricing, and will build adapters for you if you pay them, but moving toward a self-service model. My understanding is Fivetran still nets out as a bit more expensive than Stitch, but they certainly build extremely high quality integrations.
Fivetran makes considerable efforts to normalize data coming from source systems into a more friendly format, whereas Stitch integrations are arguably a bit less intelligent.
Data integration to SaaS products
There is another aspect of data integration, and that is how do you get data to your SaaS services rather than from them. For example, showing recent activity or orders in your CRM to help your sales or support teams.
As mentioned above, Segment (and RudderStack) have support for this, Segment also recently added support for transforming data before sending it, but this is only available to some customer tiers and somewhat limited in what it can do.
Typically I’ve seen the majority of our customers (including ourselves) building custom solutions here. Products like Zapier, AWS Lambda, Google Google Cloud functions or PubSub like setups where data can be transformed and sent elsewhere, either from sources directly, or via the warehouse.
We’ve written about how we do this ourselves in this blog post — sending dataform BigQuery to Intercom, and this general approach could be applied for most destinations, or built in other tools like Zapier too.
Census
The only dedicated Warehouse to SaaS integration tool (that I know of) on the market.
While we haven’t used it ourselves, Census has a dedicated solution for this which fits in well with the rest of the ELT architecture proposed in this post. It’s definitely worth checking out if you are thinking of transitioning from what e.g. Segment provides out the box to something more custom.
Data warehouses
There’s a lot to discuss about the options here, but typically there are just 2 market leaders right now that I would recommend. I’ve worked directly with all of these options myself, as well as speaking to many customers who use them, and have provided slightly stronger opinions on which ones I think are best.
When it comes to pricing for warehouses, it’s worth noting that what seems to matter in practice here is the cost of compute, not storage. Storage is generally cheap, and not the first thing to start hurting.
BigQuery
The best option for early stage startups, or enterprises that are willing to adopt Google Cloud, and are OK with a more self service experience and dont have custom requirements around security, or e.g running on premise.
Pay as you go pricing, for startups it will likely be a while before you incur any cost on BigQuery thanks to their free tier that allows you to process up to 1TB/month at no cost.
In my opinion, having written a lot of SQL for all warehouses here, BigQuery has the best SQL experience. BigQuery’s standard SQL is elegant and powerful, and they are rolling out improvements continuously.
Unprecedented, on demand scale. BigQuery scales extremely well, and with products such as BI Engine, can provide blazing fast query performance.
Snowflake
The best choice for enterprises with custom requirements such as SSO, on-premise, or who are tied in to AWS/Azure. A good option for startups thanks due to a partially pay-as-you-go, on demand pricing model.
Snowflake separates storage and compute, like BigQuery, meaning it has the capacity for unbounded enterprise scale unlike e.g. Redshift.
Pricing model is hybrid pay as you go. You pay for resources/minute, but clusters can be automatically turned down when inactive.
Supports structured JSON data (like BigQuery), where e.g. Redshift does not, generally a nice SQL experience, a couple of quirks to get used to.
Redshift
The choice if you are heavily invested in AWS and don’t want to add something new to your stack.
I’ll be frank here, I would not personally recommend Redshift unless you have no other choice, but it’s so popular it needs to be included. Redshift was one of the first “modern” warehouses in GA, but it’s built on a foundation that is in my opinion fundamentally limiting. While Amazon is working to correct some of these issues, some fixes still appear to be a long way off.
- Limited support for working with unstructured data, and a limited SQL dialect based on postgres.
- Requires much more management than Snowflake or BigQuery which are more hands off operationally.
- Has scale limits, although some new features coming out address this.
Other mentions
Azure / Synapse SQL data warehouse — similar issues to Redshift, no on demand pricing, great if you already know how to work with SQL Server and variants, can get very expensive due to limited on-demand pricing options. Presto/Athena — powerful, distributed queries, but not a general purpose warehouse, as a result can be hard to operationalize. It’s not easy to create new datasets with Athena.
Data modelling
It’s impossible for me to write an impartial comparison of the options here as this is the product vertical Dataform is in, so I’ll refrain from doing so. However, as this is a fairly new part of the stack that arises from the shift to ELT, I’ll explain a bit more about what it is and why we think it’s important.
Data modelling is what your data team probably spends 50% of their time doing — turning your raw data into reliable, tested, accurate and up to date assets that can power your companies analytics.
When data lands in your warehouse it’s usually a bit of a mess — you will have hundreds of source tables with different schema structures, different data formats and types, primary keys, so on. Writing a query to join this all together is tricky, especially if you have to do it every time you want to answer any question.
When data modelling is done right, you should end up with a clear well defined set of tables that can be used for analytics and visualization, and encapsulate all of your business logic to create a clean and well tested schema that can be consumed elsewhere, visualization tools, or sent to other applications.
Data visualization and analytics
There are a lot of options here too, and your mileage may vary. I’ve tried to summarize the position they capture in the space where I can, and I am generally less able to provide strong opinions here as there are so many.
For our startup customers building a stack around a data warehouse, we probably mostly see: Looker, Metabase, Google Data Studio, Chartio, Tableau.
Despite so many options, it’s fairly easy to play around and experiment between them particularly if you have a good semantic data modelling layer maintained in your warehouse or use tools like Segment for the out of the box solutions.
Most of these products fit into a few different categories:
- Chart builders — select some dimensions, choose a chart type, customize the visualizations, put them in some dashboards (where they will eventually break, go out of date, and never be updated).
- Self service BI solutions — tell these tools about how your data is structured and how to interpret it, and they’ll try to make it easy for anyone to quickly answer questions.
- Out of the box product analytics — generally these tools run on event data, have an opinionated schema and aren’t easily customized or generic, but they do what they do well and are generally self-service.
Looker
The (expensive) market leader. Best in class self-service, fully customizable BI.
Looker users seem to have nothing but praise for the platform. It stands out because of a few things:
- It’s designed to help you deliver a self-service portal to your entire company, enabling anyone to answer business questions.
- While you can build charts in Looker, that’s not what you’re supposed to do. You teach Looker how to understand your data, and it makes answering questions a breeze.
- It adopts engineering best practices such as version control. Your data team can collaborate using git based workflows, allowing you to scale to even hundreds of analysts.
Looker requires an investment of both time and money, but what you get out the end is something few other solutions provide.
Metabase
The open-source self service BI leader.
Adopts the Looker concept of modelling data to make answering questions easy. Doesn’t adopt a git based workflow however. It’s open source and you’ll need to run it yourself on your own infrastructure. Seems to be a favourite amongst engineers.
If you wanted a Looker like experience but the price tag is too much, then this is probably your next best bet.
Like Looker, makes it easy for non-SQL users to answer questions without relying on the data team.
Data Studio
A powerful and free chart builder.
Primarily a chart builder, but you can kind of make it work for self-service dashboards up to a point. While it’s worth a mention, unlikely to serve as your primary BI portal as your team grows, but a great place to start if you are already in the Google stack.
Tableau
The chart builder incumbent, extremely powerful, loved by many.
Tableau has limited data modelling capabilities, but is extremely powerful and able to build a huge range of visualizations, dashboards, so on. For better or for worse, you can do pretty much anything with it.
If you don’t have existing Tableau experience then this probably isn’t the place to start, consider Metabase, Looker, Chartio instead.
Chartio
Reasonably priced, self service BI and chart building.
Chartio makes it easy to build SQL queries without actually writing SQL. This makes it great for putting data in the hands of your whole team, and while it has some data modelling capabilities, it’s not quite in the same camp as looker though.
You can build complex SQL pipelines (multiple joins etc) through a UI interface which can be great for those who aren’t as comfortable writing SQL themselves.
Redash
Open source chart builder, recently acquired by Databricks.
One of the first tools we used ourselves. Redash is open source and works with modern warehouses, but is primarily a chart builder and may serve you well at first, you’ll probably spend a lot of time fixing queries unless you heavily invest in a data modelling tool too.
Heap / Mixpanel / Amplitude / Indicative
Out of the box event based customer and product analytics.
I’ve put all these together, as they really all do a similar thing. They all have pros and cons but ultimately exist to solve the same problem. Send them events, and you’ll be able to quickly answer questions about your users behaviour, what actions they took in what order and with some of these tools do experimentation, optimization or even personalization.
Think Google Analytics but on steroids, and with a focus on user behaviour and engagement.
If you want to start off warehouse-less, these are probably your best options to begin with but consider tagging with Segment from day one.
Indicative is a little different to the others, in that it’s primarily designed to be pointed at your data warehouse. If you want to transform event data in your warehouse and then analyse it in a UI, this is where Indicative shines and we hear great things about it, but probably not the place to start if you want an all-in-one out the box solution.
Google Analytics
The out of the box web and app analytics all-in-one incumbent, basically free.
Despite having more advanced, custom solutions in place now, we still send data to GA because there are some questions that are just a lot easier to answer with it. Acquisition for example — data is enriched with geo locations, UTM tags are automatically grouped and categorized.
The fundamental challenge with GA is that it doesn’t give you access to raw data. As we’ve said above, consider tagging with Segment instead to make sure you retain the raw event data.
The Microsoft / Azure stack
Worth a mention, Microsoft has their own entire set of solutions for most steps we’ve described above.
We see it less with startups, and it also doesn’t really follow the ELT model so I haven’t included it in the above. If you’re just starting out and want to get set up quickly, this probably isn’t the place to start.
You’ll probably only want to go down this route if you’re already heavily invested in Azure and Microsoft products as an organization.
As part of their offering, they have the following tools:
- Azure Data factory — highly customizable ETL like workflows for data integration
- Azure / Synapse data warehouse, a more scalable sql-server based warehouse with some support for on-demand pricing
- Power BI — a powerful data analytics product with some support for data modelling but no Git based workflows and some support for warehouses such as BigQuery and Snowflake.
Conclusion
Hopefully this overview of the product options and the core parts of the stack helps you and your team make a decision when it comes to setting up a data stack that will be able to scale as your data team and the complexity of the data problems you face grows.
I believe regardless of the product options you choose above, following the ELT architecture outlined in this post should ensure that you are able to cope with new requirements as your stack evolves without hitting any major roadblocks.
Originally published at https://dataform.co.