Designing and building a reasonable & cost-effective modern data stack

Anas El Khaloui
hipay-tech
Published in
4 min readOct 25, 2022

18 months ago, we decided to get ourselves some new data tools in order to:

  • Unlock new use cases
  • Minimize maintenance costs and increase code and data quality
  • Migrate to the cloud and simplify our tooling

“Mistakes done very early in an infrastructure project have tremendous negative impact later on.”

As everyone starts to notice, the modern data stack (and pretty much any cloud infrastructure, really) is awesome but can get costly if built without proper planning and care: cloud bills, licensing and other costs can grow fast and substantially increase a company’s operational expenses. That is an important point to consider in the current macro-economic situation.

Source: Synergy Research Group

At HiPay, we are migrating our tech to the cloud, and using a FinOps by design approach (spoiler: everyone should).
Mistakes done very early in an infrastructure project have tremendous negative impact later on.

Obviously, we decided not to go with a lift and shift approach.

A steam locomotive pulling horse carriages, an illustration of the “lift and shift ” approach a.k.a putting the old stuff into the new thing (credits to Paul Graham)

Here’s a summary of the choices we made and the reasons we did:

  • Cloud data warehouse: fully-managed and serverless, BigQuery is the cornerstone of our data stack.Easier to manage than Redshift, cheaper than Snowflake and super versatile, it just works great ! It also gets new features very often, and integrates perfectly with GCP’s data services. BigQuery is one of the main reasons Google Cloud Platform is becoming the leading cloud for Analytics.
  • Orchestration service: Airflow has been running at HiPay for close to 2 years now, and it has been adopted by all the data community. It’s used to move data around and manage data pipelines. It’s pretty much the command panel of our data operations, showing all running pipelines, and providing observability and alerting. We wrote a little series of stories on the topic.

“Make each program do one thing well” — Unix Philosophy

  • Data integration: Meltano is our final choice after a benchmark vs Airbyte. Meltano is “pure” in many aspects and very elegant in its implementation. It also matches the philosophy of our existing infrastructure.
    –It’s CLI-first, and that makes it CI-compatible
    –It’s all code, and that makes everything version controllable
    –It’s just 100% Python, our main programming language
    –It doesn’t try to handle orchestration by itself, Airflow is there precisely to centralize data-related orchestration ! ”Make each program do one thing well” — Unix Philosophy
    –It doesn’t need a bunch of VMs or a Kubernetes cluster running 24/7 idle 90% of the time (those are pricey !). You call it in a docker container when you need it to batch sync two data sources, it does the job and shuts down, until you call it again.
    –It’s free to use and open source, and you can develop your own connector quite easily in Python, thanks to a standard data exchange format
    –It’s super modular and composable, and has a great number of connectors. It’s based on Singer just like Stitch from Talend
    We looked into Fivetran, but the pricing model, proprietary tech and lacks of customizability were a no go for the kind of data we manage (millions of rows a day..)
  • Data transformation: dbt, obviously. It’s dockerized and called by Airflow in GCP’s Cloud Run (a serverless service to run docker containers through API calls). Transformations run on BigQuery.
  • Data discovery: Amundsen is used today, it’s pretty good but we’re looking at DataHub’s much richer features and scope. We wrote about that too.

“No computations are made on the Airflow machines in order to decouple orchestration logic and computing”

  • Compute: No computations are made on the Airflow machines in order to decouple orchestration logic and computing. We use:
    –BigQuery to handle SQL data transformation workloads,
    Dataflow (Apache Beam under the hood) for bigger and more complex data tasks (like training thousands of ML models or parsing huge amounts of logs). We use DataFlow to parallelize the execution of arbitrary Python code, without the limitations of Spark. DataFlow has automatic vertical and horizontal scaling, and you can run the same code in batch and streaming mode, which is quite impressive really
    –Cloud Run for pretty much everything else
  • Data Visualisation: We went with Looker, but we use its data modeling language (LookML) as little as possible and only for last mile work in order to avoid dependency on a BI tool for data transformation.

This stack is based on a combination of managed cloud services and open source products, with no SaaS. This gives us full control, creates no dependency on proprietary software or formats. Billing is also way more predictable and simple (one single bill). It’s also built with minimalism and simplicity in mind, using the usual software engineering best practices.

All this is completed by an automated interactive dashboard with company-wide and careful cloud costs tracking. It displays cloud billing, split by cloud service, team, projetct, etc.

Simple, minimal (french annotations).

This was a quick rundown of our thought and decision process, with an overview of the stack we’re buidling. We might go deeper in future posts.

Give us a little 👏 if you found this to be useful ! or leave a comment if you want more focus put on specific aspects in the future 😉

--

--

Anas El Khaloui
hipay-tech

Data Science Manager ~ anaselk.com / I like understanding how stuff works, among other things : )