How I Have Set Up a Cost-Effective Modern Data Stack for a Charity

A Look Back at a Data for Good Project

Marie Lestavel
21 min readMay 9, 2022
La Porte Bleue is a solidarity grocery based south of Paris

As you read the title, you may be wondering how the words “Modern Data Stack” and “Charity” ended up put together, chances are my non-tech readers are not even familiar with the term “Data Stack”, let alone “Modern Data Stack”. To make sure everyone is on the same page, a Data Stack can be defined as a set of technologies and services that organizations use to store, manage, and access data. If these two terms may seem incompatible, it is because, in the collective imagination, a data stack is generally considered a high-tech organization territory, as opposed to charities which are seen as low-tech organizations.

According to the Discover Data Science organization, while 90% of nonprofits collect data, about half do not fully exploit it. Most of the time they lack the know-how to leverage this data but also generally lack the financial resources to allocate some budget to it. Yet, nonprofits can benefit from data: data reports are often the currency for receiving grants from public entities.

In this blog post, I will show you how I have helped a charity known as “La Porte Bleue” by setting up a modern data stack allowing them to automate reporting, and thus unlock grants, but also to leverage their data to get insights from it. I will also show you how cost does not have to be a barrier and how it is possible to develop a cost-effective solution, that does not entail any financial costs by combining a set of cheap tips & tricks.

Github repository: https://github.com/Porte-Bleue/mongo-to-postgres-etl

What is La Porte Bleue?

Picture from La Porte Bleue Instagram account

To understand where my project is coming from, I first have to tell you about La Porte Bleue (translates to ‘The Blue Door’ in English), a solidarity grocery based in Malakoff, south of Paris. The purpose of this charity is to collect non-perishable foodstuffs with the aim of distributing them to people in need. Every Saturday, La Porte Bleue helps around 100 families in front of the parish church of Malakoff.

Life Before The Solution: Manual Data Entry on Paper

Paper file used before the implementation of the app for inventory tracking

Whether profit or non-profit, an organization cannot effectively coordinate its operations without implementing robust processes. Like any grocery shop, keeping track of the inventory is crucial for La Porte Bleue to work towards matching supply to demand. On top of that, the organization also has to keep a record of the families it is helping for the purpose of reporting on its activities internally (to board members) and externally (to public entities). Compliance with these reporting requirements is highly important since the payout of subsidies from public entities depends on it.

As a fledgling organization, La Porte Bleue started by putting in place mainly manual processes around inventory management and family registration. These processes were initially carried out manually on paper, but with growing demand (from about 10 families helped per week initially to more than a hundred as I am writing), they have become more and more both a headache and a burden to conduct.

The Game Changer: A Web App Digitizing Processes

From this point, the board of the organization came to my brother to ask him if he would be willing to code an app to digitize these processes and implement them at scale. In a few weeks, the web app was shipped and started being widely used across the organization. The back-end of the app obviously connects and interacts with a database storing the data. From that moment, the web app was opening up new possibilities when it comes to analytics since the organization could now leverage this data source to access the now-digitized data.

Architecture diagram of the web app

Data Automation: New Opportunities But Also New Challenges

With the introduction of the web app, data was automatically digitized, putting an end to manual data entry for good. Time was saved and data accuracy was improved by minimizing human error on manual data entries. The data could now be extracted from the production database and, more importantly, the set of data was richer than ever, including transactional data about inventory flows or visits as well as personal data about families and family members. Rapidly, there was a growing need from the various stakeholders of the organization to access this data and eventually turn it into insights.

Unfortunately, while data was broadly available, data access was not trivial. Not everyone knows how to interact with a database. The new system definitely sets a technical barrier for access to data, posing significant hurdles to technical novices. Only the developer of the app knew how to query data from the database, but relying on him for any data requests was not sustainable. Being a data analyst myself, it was a wonderful opportunity for me to use my data skills for a good cause and contribute in my own way to this project that I have been following closely for a while. So, naturally, I offered my help and that’s where this project was born.

Needs for Reporting and Insights

Broadly speaking, what the organization needed was analytics. More specifically, there were 2 sets of needs:

  • Reporting: As a solidarity grocery store accountable to the Food Bank (aka Banque Alimentaire), the charity is required to report on its activities to this institution by means of a quarterly report containing key figures and information on the people helped as well as the volume of food distributed. One of the most urgent needs was therefore to put together this report for them by leveraging the analytics data.
  • Insights: In addition, there were general data analytics needs, from monitoring key organizational metrics to analyzing transactional data (inventory flows, visits, etc.) to gain insights from it.
Template of the Food Bank report to be completed quarterly

Non-Profits and Cost Constraints

Inevitably, the cost aspect came into play when I started collecting the requirements for the project. Since the charity has limited cash flow, money is wisely spent, and fitting into the budget is already a struggle for the organization. The instruction was simple: the non-profit organization should not spend any money on this project. The development of the web app had the same requirement and was successfully met thanks to free cloud-based hosting services like Heroku.

In addition to this, time was also a constraint. Until I came up with a solution, the data remained hardly accessible making the organization extremely dependent on the developer to query this data. Consequently, the sooner, the better for the organization.

The Goal: Build a Cost-Effective Analytics Stack

From the needs and requirements I had collected, my goal was now becoming clearer. It is to build a cost-effective analytics stack that meets a charity’s reporting needs and helps the organization make more informed decisions.

Why cost-effective? While the project should not entail a financial cost, it will have a non-negligible time and human capital cost and can not be considered “free”.

A Minimum Viable Analytics Stack as a Start

Now that the project goal was set, the next question was how to achieve it?

Ok but first, let's get this straight, what is an analytics stack?

“An analytics stack is an integrated system of applications that collect, combine, analyze, and realize the value of data. — Source"

An analytics stack can go from the simplest set-up to the full-on data stack. However, considering that data access became a bottleneck across the organization, my idea was to quickly get a bare-bones data solution up and running to break these barriers to data access as soon as possible. For now, it was about focusing on what’s critical and setting aside everything that is “nice to have”. What could I build relatively quickly and affordably to fix the most pressing pain points?

Discarded Solution #1: Analytics out-of-the-box of MongoDB

At its most basic level, an analytics stack is a link between raw data and business intelligence. In my case, there is a single source of data: the production database. The web app runs on a MongoDB database undertaking the production workload but could also very well support an analytics workload on top of it.

Let’s take a quick look at the size of the database:

db.runCommand({ dbStats: 1 })

The database contains about 18k objects, specifically documents, across all collections, and among the 19 collections, only 6 were actually needed for analytics.

show collections

Since the data comes from only one source, containing a small set of data, the initial analytics setup could have been very simple:

  • Simply connecting a business intelligence tool to the production database
  • And serving up the data to stakeholders as reports or visualization
Analytics setup out-of-the-box of the MongoDB database

But here’s the thing: the web app runs on a NoSQL database. While running analytics directly on the MongoDB database could have been possible, both the analytics capabilities and the choice of reporting tool are limited and I rapidly realized that it was not well suited for my case. In order to meet the reporting needs, the data requires some transformation and modeling that would have been very complex to achieve out-of-the-box of the MongoDB database.

In fact, doing analytics is generally seen as a SQL database territory. Not only does SQL offer flexibility when it comes to retrieving, filtering, joining, and aggregating relational data with ease but SQL is also my day-to-day playground at work and therefore my preferred language. Using a MongoDB database was a choice from the developer of the app solely based on the engineering use case. However, since the analytics concern only came later, it was a given for my project. From this point, I was evaluating options to actually run analytics with SQL despite having a source of unstructured data.

Discarded Solution #2: Data Virtualization with Trino

That’s when I eventually heard about data virtualization and Trino. Data virtualization describes an approach to integrating data from multiple, disparate sources into a holistic, logical view without having to copy or move the data physically. In simple terms, data remains in original sources while users can access and analyze it virtually via a middleware. And what if this single representation of data could be queried in SQL? This is precisely what Trino offers. Trino is a distributed SQL engine for running queries against various data sources. With Trino, you can query data where it lives without having to migrate it to a single location. Trino comes with a MongoDB connector allowing the use of MongoDB collections as tables in Trino. Ultimately, it means being able to query data from a MongoDB database in SQL. Trino can then connect to a Business Intelligence tool to build visual reports and make them accessible across the organization. As an open-source solution, Trino was also free to use.

Trino is a distributed SQL engine for running queries against various data sources

Since Trino seemed well-suited for my use case, I did a small trial of the tool. Trino ships with a Docker image that does a lot of the setup necessary for Trino to run. With this, I was able to successfully run SQL queries against the MongoDB database on my local machine with ease. Running Trino locally was easy, however, deploying comes with storage and compute costs, generally not for free. On top of this, Trino’s official Docker image is large, making it more challenging to host it for free. I tried to deploy Trino on Heroku but Heroku’s free tier has a memory limit that Trino was exceeding. Trino turned out to be probably too heavy for my use case, so I decided to evaluate other options.

My Solution: Plug a Business Intelligence Tool into a Replica Database

Another solution was to set up a replica of my production database. It means copying my MongoDB data into a SQL database to then use the relational version of the data for analytics. While I was iterating on a new possible solution, I was actually moving towards the most common setup for an analytics stack.

An analytics system generally does three basic things:

  1. Collect, consolidate and store data into a central repository of data, usually a data warehouse.
  2. Transform or model the data so that it ends up nice and sparkly and ready for consumption in your data warehouse.
  3. Presenting & using data, by extracting, visualizing, or pushing data to different services or users that need it.

You must be thinking, all this seems very heavy and costly to implement, what about the initial idea of building something quickly and affordably? Fortunately, data stacks become easier to set up as technology develops, but also at a lower cost. Modern data stacks are praised as low-cost platforms that are both easy to set up and easy to use. The current model for a modern data stack generally consists of four key components:

  1. Data Ingestion solution
  2. Data Warehouse
  3. Transformation tool
  4. Business Intelligence tool

By moving towards the implementation of a replica database plugged into a Business Intelligence tool, I was actually setting the ground for components 1, 2 & 4 of a modern data stack. Only data processing (transforming and modeling data) would be performed within the Business Intelligence tool instead of a transformation tool.

Analytics setup consisting of a Business Intelligence tool plugged into a replica database

Cheap Tip #1: Set up a Free Postgre Database on Heroku

“A data warehouse is a central repository of information that can be analyzed to make more informed decisions. — Source

Warehousing is essentially about storing the data in a central place. While cloud-based warehouses are incredibly cheap compared to on-premises solutions, they typically offer on-demand pricing models where you pay for what you use and hence are not entirely free of charge. Since my project involves a small set of data and few analytical use cases, it made sense to start out with a normal SQL database as my data warehouse. With Heroku Postgres, it is very easy to set up a free PostgreSQL database. Even if the free plan has some limitations (enforced row limits of 10,000 rows), it was enough to get started with my project as the MongoDB database contained less than 10k documents (one document equals one row in a relational database) at that time.

Postgres database Heroku dashboard

ETL process: Low-Code versus Manual ETL

Now that my dedicated analytics database was in place, the next step was to pull in the raw data from the production database. Data collection & integration is usually performed via a standardized process known as ETL.

“ETL is the process by which data is extracted from data sources that are not optimized for analytics, moved to a central host, and optimized for analytics. — Source

While traditional ETL requires certain coding expertise, the rise of ingestion-as-a-service tools has made data collection more accessible to people with low technical expertise. No- to Low-code ETL tools such as Stitch, Airbyte or Fivetran are powerful cloud-based services to move data automatically into the destination of your choice.

These products provide pre-built connectors to common data sources, including MongoDB, enabling a plug-and-play transfer of your data. Unfortunately, none of the previously mentioned ETL platforms offer a free plan (or not anymore) and typically charge based on the volume of data loaded.

Therefore I considered a more in-house solution: manually build the ETL process in Python. The idea is to write a custom Python script that serves two purposes:

  1. Extract data from my MongoDB database
  2. Load it to my SQL database

Manual ETL does require more technical skills but one of the reasons Python is so popular is also for a large number of libraries available. Among them, there are powerful packages to perform ETL processes in only a few lines of code.

  • On the one hand, you have PyMongo which makes it easy to fetch data from MongoDB.
  • On the other hand, among others, SQLModel is an easy-to-use library to interact with SQL databases.
Manual ETL process with Python

Using these two packages, I managed to implement an ETL process in about 200 lines of code. By running one Python command, I am able to replicate my production database and turn unstructured data into structured data stored in a relational database.

Ever since I became a data analyst, SQL has been an essential part of my analytics toolbox. Now that the data was replicated in a SQL database, I could play with it with ease. From that moment on, I was able to easily answer the first data requests from the organization and quickly deliver value without having to wait for the project to be fully fleshed out.

Cheap Tip #2: Business Intelligence with Preset Cloud

Pulling data with a SQL query is one thing but presenting it in user-friendly views such as reports, dashboards and charts is something else and is known as Business Intelligence. From this point, adding a Business Intelligence layer to my data stack could allow me to provide better access to analytics across the organization and enable users to find the right data quickly.

Analytic databases are specifically designed to support business intelligence and, compared to NoSQL databases, there is also a much wider range of BI tools compatible with SQL databases. With cloud-based solutions, getting started with Business Intelligence is at anyone’s fingertips and becomes as simple as plug-and-play. Obviously, fully-hosted cloud services come at a cost and usually offer a free trial for a limited period of time (like Metabase Cloud).

However, among modern business intelligence tools, there is a new kid on the block offering a very attractive free plan: Preset Cloud. Launched in August 2021, Preset provides a managed service of Apache Superset, an open-source business intelligence software. With Preset Cloud’s free plan, the first 5 users are free forever and, in my case, it was more than enough to start with. In a matter of minutes, I was able to create a new Workspace for “La Porte Bleue”, connect Preset to my SQL database and start exploring & visualizing the data.

First exploration of raw data on Preset

From a MVP to an End-to-End Modern Data Stack

At this stage of the project, I had reached a first milestone: having set up a minimum viable analytics stack. All I had to do was manually run one simple python command to refresh my analytics database with the latest data to keep my data sources on Preset up-to-date. Most of the charity’s activities take place over the weekend, whether it is collecting food or distributing it. Since there is generally no new data between two Saturdays, executing the script once a week was enough to keep my reporting on track.

However, as I was playing around with the data on Preset and getting more familiar with the set of data, I realized a few things:

  1. The raw data as it was loaded by my ETL process was not structured in a way that is conducive to data analytics. Many transformation steps in SQL were required in order to make it analysis-ready.
  2. The data was also often incomplete: some tables had missing data, which was due to incomplete information collected on families during registration. However, I had no clear visibility into the extent of the flaws.
  3. One of my 5 raw tables with the finest level of granularity was getting bigger and bigger and was soon reaching 10,000 rows (Heroku row limit).

Fortunately, implementing a data transformation workflow could help me address these issues, bringing me to the next step of my project: adding a transformation and modeling layer to my data stack.

dbt as Data Transformation Layer

Data transformation is the stage that turns “raw” data into “refined” data — in other words, makes data usable for analytics. Back in the days (in the 90s), when storage and bandwidth were incredibly expensive, data used to be transformed before being loaded into the data warehouse as part of a traditional ETL (Extract — Transform — Load) process. With the arrival of cloud data warehouses, storage and bandwidth cost were considerably reduced. As a result, loading untransformed data in the data warehouse was not a concern anymore and the data transformation step could easily happen In-Warehouse. In modern data stacks, data transformation tools like dbt perform the T (Transform) in ETL. Essentially, it is all about transforming the data already loaded in the warehouse through data models written in SQL. Strictly speaking, my data pipeline should be referred to as ELT instead of ETL since Transform comes after Load.

“dbt is a data transformation tool that enables data analysts and engineers to transform, test and document data in the cloud data warehouse. — Source

Data Build Tool (https://www.getdbt.com/)

dbt is a familiar tool for me since it is part of the data stack of the company I work for. Setting up my dbt project and building it from scratch was truly easy and fun (I swear!). But above all, adding dbt as a data transformation layer had real benefits for my projects:

  1. With dbt, I was able to standardize my data and put it under the right structure thanks to a set of interdependent data models. Instead of creating datasets within Preset from long SQL queries, involving many code duplication across queries (eg: to unnest arrays), I could split my blocks of code into multiple models organized in an efficient way. No more long SQL queries in Preset, my Preset datasets were now pointing to my analysis-ready dbt models through a simple “select *”.
  2. The ability to test data in dbt to ensure data quality is also extremely powerful. Data testing brought visibility into the known but yet hidden data quality issues across my set of data. I was able to identify all the incomplete/missing data and share this with the organization so that they can correct it by collecting missing information from the families.
  3. Finally, dbt allows you to create data models materialized as views which are essentially “virtual tables” whose data is not physically stored in the database. You’ve caught on: I could create as many views as I want, but it would not create any additional rows in my SQL database. dbt was a great ally in dealing with a database having an enforced raw limit to persist data at my convenience. As previously mentioned, my operations table containing one row for each stock operation performed in the app would soon exceed, by itself, the limit of 10,000 rows enforced by Heroku free plan. Thanks to dbt, I could create an incremental model aggregating the operations data at the product and daily level and store this data into a table built incrementally. On each dbt run, dbt will insert or update records if they do not exist yet in the table. So yes, this time the data is physically stored in the database but as the data is aggregated, the number of rows is considerably reduced. All I had to do was to limit the extract of the “operations” data to a relative time period (let’s say the last 7 days) in my python script so that my operations raw table contains less rows but continues to feed my incremental model.

I could write a lot more about the benefits of dbt given how powerful a tool it is, but above all, it has the advantage of being completely free in my case.

Lineage Graph of ‘La Porte Bleue’ dbt project

Cheap Tip #3: Productionize dbt deployment with dbt Cloud

Since dbt is an open-source solution, running a dbt project from the Command Line Interface (CLI) is completely free. dbt also offers a hosted service named dbt Cloud that helps you productionize dbt deployments. dbt Cloud provides a variety of features including Browser-based IDE, unlimited daily runs, job scheduling, GitHub integration, or logging & alerting. With dbt Cloud, there is 1 free-forever developer seat that a one-person data team like myself can definitely take advantage of. Thanks to dbt Cloud, I was able to fully automate the scheduling of production refreshes at the pace and frequency I desired.

dbt Cloud project dashboard

A Fully Automated Data Pipeline

The final milestone was to remove the human-in-the-loop, i.e. myself running manually the python script, in order to get to a fully automated data pipeline. How to? By orchestrating the Extract & Load workflows. Data orchestration is the process of coordinating the execution and monitoring of these workflows. In simple words, it is about scheduling tasks to be executed sometime in the future.

Since my ETL process was coded in python, it made sense for me to explore Python-based “Workflows-as-Code” orchestration tools. In this area, platforms like Prefect can make the entire process ridiculously easy and cheap. Prefect’s promise is simple: If Python can write it, Prefect can run it. With Prefect, a Python function can be turned into a Prefect Task simply by using a ‘@task’ decorator. Creating a Prefect workflow from my Python script was therefore incredibly straightforward. Prefect is the workflow engine while Prefect Cloud is the command center for the workflows coming with a slick UI:

Prefect Cloud project dashboard

This part is free. But Prefect also has to run the tasks. Prefect has chosen a “bring your own infrastructuremodel, therefore you need to pay for the compute needed to run your tasks. For that reason, I am currently running Prefect with a local agent on my machine until I move to a serverless solution. Although Prefect was a good fit for my project, the computing costs makes me move away to a more basic serverless solution. As a next step, I will instrument my python code in AWS Lamba which is a serverless computing service from AWS remaining free after 12 months (unlike all other AWS services).

Ok, But What Does the Final Result Look Like?

Here we are 🎉, I had just finalized the first version of my end-to-end modern data stack. In the end, my analytics stack can be represented by the following Data Stack Canva:

La Porte Bleue Data Stack Canva Diagram

As of now, my data pipeline is updating the raw data every day, while dbt, combined with dbt Cloud, does the data processing in order to refresh data models exposed as data sources in Preset Cloud.

Within Preset, there is no SQL code that is not version controlled beside the one line of SQL pointing to the entire table. Preset datasets are created as-is from end dbt models.

My dbt models exposed as datasets in Preset

From a dataset, I have then built individual charts organized into a dashboard that makes up the final product for our end users. A dashboard is a collection of charts designed to form a narrative. As of this writing, I have 3 ready-to-use dashboards that meet the basic reporting requirements of the project:

  1. A first dashboard featuring all the key indicators required to generate the Food Bank’s quarterly report.
  2. A second dashboard enabling users to get different facets of information about families using metric-driven charts to convey quantitative data.
  3. Finally, a dashboard monitoring inventory inflows and outflows to convey insights on demand (which products are the most distributed?) and supply (which products are the most collected?) helping the organization make informed decisions in order to close the gap between supply and demand.
Overview of the Family Dashboard on Preset

It’s Only the v1!

I am very proud of what I have accomplished throughout this project. I have managed it in an agile way, where each iteration was bringing me closer to my end goal. I’ve been down some paths and sometimes I’ve had to backtrack (Trino, Prefect), but it was always full of learnings. Along the road, I have reached defined milestones, a key one being to deliver a minimum viable analytics stack, which I consider as my v0. Now, I have come to a point where the result of my project was satisfying enough to mark a stop and present it in this blog post as what I consider my cost-effective modern data stack v1. But the project does not end here, it’s only the beginning! There are many ways I could enhance my data stack. As a first next step, I plan on orchestrating the ETL process in a Serverless way by deploying my python script with AWS Lambda. I also want to make my data pipeline more efficient with incremental updates. Instead of syncing entire datasets (full syncs) at every update, deleting the data, and rebuilding the entire table, I would like to perform incremental updates and only capture the newest data since the last sync.

On top of that, there are challenges coming ahead, one of which is maintaining my data stack cost-effective with the limitations of the free plans of the tools used. The main challenge will be to stay under the 10,000-row limit in the SQL database as the data source grows and add records. Currently, the volume of rows is about 5k and is therefore compliant but every week new data is generated and leaves little leeway in the end. Nonetheless, the challenges are exciting and I am looking forward to it: building projects is the best way to learn!

--

--

Marie Lestavel
Marie Lestavel

Written by Marie Lestavel

🇫🇷 Data Analyst Based in Berlin 🇩🇪

No responses yet