Building an analytics stack from scratch

Kaustav Mitra
DataSeries
Published in
9 min readJul 17, 2019

It’s been a while with me writing, so to re-kick things off, I wanted to talk about how we at Octopus, went about building our analytics stack from scratch. If you are in a startup or a mid-sized organisation, I think you will have some of the same questions we had when we started building our stack, maybe you will find this post useful in helping you define where to start.

Octopus is not a new organisation, we have been around for about 19 years now and have grown to about 900 people now. We primarily sold offline tax efficient investment products and in from 2016 onwards we started building FinTech products around P2P investments and cash management that relied on a digital / online user experience. Hence, analytics and being data-driven became a new problem to solve. As an entrepreneurial org — yep, in this company you pretty much craft your role, team, future like a mini-startup, whenever we are building something we hustle — we act like founders bootstrapped for cash and then as the function grows, we double down, or if it doesn’t we scrap — just like a normal startup in a normal world.

So, when we started building our stack, we wanted to know who are our customers, and how we can find a market fit for our analytics while keeping an eye on cost or burn rate i.e. don’t spend too much money buying software, buy what is useful — ultimately we were taking a data as a product view.

We realised that our main users was going to be in product, marketing, customer success, operations and finance, who need to be able to make the best possible data-informed decision.

To do that, we have to achieve 2 distinct jobs:

we have to capture and consume data into our stack reliably

we have to surface that data to the end user with the least friction

1 — The Start

Figure 1 — analytics stack — version 1

When we started, our analytics stack looked pretty much like Figure 1. We had web apps that connected to our core transaction databases on Google Cloud MySQL, we had Google Analytics on our web app and we had some very basic implementation of Segment that nobody was using. All data analysis was done in redash, which required advanced SQL knowledge for even the basic queries — and that meant the average user was reliant on developers to extract data in spreadsheets for any analysis.

2 — Laying the foundations

The problem we were trying to solve quickly became driving adoption of analytics within the team. We were a young team, people were hungry to learn but we needed the right technology foundations in place.

We started two initiatives in parallel:

  • events: we had product databases capturing transactions but we were not capturing any events to build a complete picture of our users — so we started building and tracking our events through Segment.
  • visualisation: redash was not helping business users become data driven — so we looked to replace our visualisation layer so internal users can self-serve

Segment is a customer data platform (CDP) with a data connectivity layer between web & mobile apps that act as Segment sources and many 3rd party apps that act as Segment destinations. Events are written using the Segment API within the code base of our apps and then these events flow into connected destinations including our analytics data warehouse in BigQuery.

Within a few months we had most of our front-end and back-end user actions instrumented in Segment to trigger events when users took certain actions and then they were loaded by Segment into BigQuery. This meant for our events we were able to automate the capture, and loading of events data into our analytics database.

Figure 2 — Laying the foundations

We had our mobile apps released during the same time, and we implemented Segment events within the apps in the same way and connected the mobile apps to 3rd party destinations.

In parallel to figuring out the right 3rd party tools for the different use cases, we upgraded our visualisation layer beyond redash to ChartIO. There are tons of visualisation applications out there but they mostly fall in either of two buckets — built for exploration (typically called visualisation apps) and built for consumption (typically called BI tools). I wanted something that is built for exploration, cheap to buy, and can connect to multiple databases.

This left us with a few choices: ChartIO, Mode, Looker, Tableau, Superset, Periscope etc.

I would not go into the relative merits and demerits of any of the above, which I think is a post in itself. But based on our team size, budget, ability to start with a small no. of licenses and finding our team comfortable with the interface during evaluation made us choose ChartIO.

During the same time, we brought in a few 3rd party tools for developers, product managers, and digital marketing teams as follows:

Amplitude — For product managers to understand user behaviour, interaction funnels, user segments and user paths from events. Plus Amplitude has a very generous free tier that we could readily hook up with Segment empowering the product teams.

FireBase — On mobile, Google’s Firebase provides a whole suite of features like analysing events, building custom user segments and targeting transactional push notifications etc. Having Segment events meant we could start seeing the data in Firebase straightaway with minimal dev time.

Advertising Pixels — We hooked up our events to Google Adwords, Facebook Ads and LinkedIn Ads which took about 5 minutes — set up the ad accounts, grab your api key, and add them through a Segment destination. The digital marketing team could now build custom audiences for re-marketing, re-targeting and understand which campaigns were performing as we were able to attribute revenue generating events to ad campaigns.

Google Analytics — We had GA to purely understanding audience/visitor trends and demographics.

Looking back now, I would say the three pillars of our analytics stack became:

  1. BigQuery — a reliable, cheap data warehouse that works on petabytes, has a clean Python API, integrates with Google Sheets
  2. Segment — a rock solid foundation to manage our events, hook up 3rd party apps for deeper analysis, integrate with our analytics data warehouse and provide GDPR controls for privacy management.
  3. ChartIO — a visualisation layer that enables our whole team to be data driven with a smooth learning curve into SQL and ability to build complex analysis using a visual tool and simple SQL.

3 — The stack of today

With the mail pillars of our stack in place we started exploring how to add more to our stack to meet the needs of various users within our organisations.

Figure 3 — the stack as of today

Email Marketing — Our customer teams were using up to 3 different email marketing tools and every campaign was extremely manual. We implemented HubSpot as our single platform for email marketing, landing pages, and email automation.

Customer Support — We replaced Zendesk with HelpScout as our customer support tool as it integrated with HubSpot, had much cleaner analytics, you could build fully functional knowledge base, was up to 70% cheaper than Zendesk and had a rich & simple API.

Personalisation — Our operations and customer success team came up with the idea to provide helpful messages and popups to our users to improve on-boarding experience and aid activation and retention. After looking at a few products out there, we settled on AppCues. They had a strong roadmap (checkout their product-led growth mission here), the necessary privacy approvals in place, and integrated with Segment. We have also invested in Taplytics to be our A/B testing platform.

Managed (E)xtraction in ETL — The data we were generating within Hubspot, Google Adwords, Facebook Ads, HelpScout, Stripe etc. had to be brought together into our analytics database. For that we leveraged Stitch, Segment or our own homegrown solution for data extraction — it’s fully automated and scheduled, and it means our team hardly spends any time on boring data extractions.

Higher-order metrics — The beauty of doing your job well in data and analytics space is you end up having time for more. Because we had empowered the different teams with a set of applications fit for purpose, we started spending time on understanding more complicated metrics of revenue, transactions and users across cohorts. For that we built our own python application code-named rogue-one, that calculates metrics from our raw data, runs sentiment analysis and ML-based support ticket tagging, etc using Airflow DAGs. We are nowhere near done and we now spend a lot of our time now actively developing rogue-one.

4 — Next steps

With all the building blocks in place, our analytics stack was scaling from being only within a small team to almost every part of the organisation. There were 100s of users on ChartIO, many of them constantly building their own queries and dashboards. We had data flowing into BigQuery, our 3rd party apps were nicely integrated and then we realised we had one more problem to solve —

our data was modelled for power users but not for business users — every time there were changes to our data models, our business users struggled — we needed to find a way to humanise our data for business users and make sure they get consistent answers every time.

Figure 4 — the next steps

To make our analytics stack scale to 100s of users within our organisation, we are now investing in Looker to be our single source of truth for all analytics consumption and visualisation, ChartIO is no longer fit for us and we are powering Looker with consistent data models, cleaned and pre-aggregated using dbt.

Looker, has been exceptional at driving adoption of data across hundreds of users through a self-service BI platform that has been built for business users. Also, because it uses LookML, you can build a single consistent version-controlled data layer between your users and the underlying data.

To build all the underlying data we are using dbt to

  • build complex big denormalised models with pure SQL to minimise using PDTs in LookML — we think this will strike the right balance (more on that here),
  • generate documentation and lineage for our data (dbt documentation — it’s crazy cool, almost magic!!!), and
  • perform unit tests on our data before we consume it in our analytics

Conclusion

There is unfortunately no correct way to building an analytics stack from nothing and there are plenty of analytic tools in the market that could be used. But I think depending on which stage of your analytics journey you are, you will find a particular set of applications that does the job and the same applications then become obsolete later. Hence, you always need to think about analytics tools that plays nicely with your existing stack, have great customer support and don’t have a massive lock-in impact — ask yourself if the application is no longer fit for purpose in 6 months time, how easy is it to switch out, what are the contract termination clauses, what is the product roadmap of the application — don’t buy an application that looks nice, and is going to create a silo, use something your users are going to love — and keep an eye on the price (don’t buy MixPanel $800/year when you can get away for free with Amplitude).

In any organisation, the analytics stack should continuously evolve — there is no one-size fits all, but there is always a starting point. Hopefully this post helps in defining what that starting point could look like.

--

--

Kaustav Mitra
DataSeries

ex-aerospace, building paradime.io to fix data's people problem — love building new stuff, meeting new people and solving crisis. and really bad at writing!