Data Modeling Today: launching cost-effective analytics for ManyChat

Published in

Manychat Tech Blog

6 min readAug 23, 2022

In my previous articles, I raised a question about the necessity of data modeling in the world of the modern data stack. A small practical case from the previous article shows that benefits of data modeling exist, but they are not … that impressive.

In this article, I’d like to consider a rather big practical case — launching an analytical system for a rather complex business. And I think that here we start seeing some real benefits of data modeling in the world of the modern data stack. In a few words: proper data modeling will take less time and money to add new features or new capabilities to the analytics. Your data will increase over time, even if your business is stable. But proper data modeling can help you avoid proportional growth of costs. Let’s look at the details.

Analytics gives you the possibility to ask questions about your business and get reliable answers and actionable insights.

ManyChat is a rapidly growing startup and a leader in the area of chat marketing. It provides its clients a zero-code tool to create chatbots, it is integrated with Facebook Messenger, SMS, email, WhatsApp, Instagram, and Telegram, and it lets businesses communicate with their customers via these platforms.

In 2019, ManyChat still used a dedicated PostgreSQL (AWS RDS) database as its main tool for product analytics. This database was able to answer some analytical queries, but its size was almost 2 TB, and the amount of answerable analytical questions was rapidly decreasing. Also in 2019, ManyChat used only a FB Messenger channel for all chatbots, but it planned to launch other channels: SMS, email, Instagram, WhatsApp, and Telegram.

Therefore, in 2019, ManyChat decided to rebuild our analytic system to get ready for new challenges:

Business growth, data growth. Traffic increased, more data was sent to analytics about each user, contact, and conversation. Data volumes exceeded not just 2 TB, but 20 TB, and even 200 TB the following year.
Data complexity rise. Conversations through FB Messenger differ a lot from conversations through SMS, email, Instagram, WhatsApp, and Telegram. Each channel has its own specifics, its own unique entities with specific relations between them.
Business model change. As a rapidly growing startup, ManyChat experimented a lot with its business model and tested various key metrics. Therefore, analytics couldn’t rely on a fixed set of metrics — it had to recalculate all the metrics, not just for the current period, but also for previous periods of time.

So, long story short, ManyChat needed a bottomless analytical platform to receive and store ever-increasing volumes of data that can answer any analytical questions without any effort on the part of engineering. Therefore, it should have been a scalable cloud-based analytical database, able to process any ANSI SQL queries. At that moment, ManyChat hired me to build an analytical data platform, based on a cloud-based analytical database.

Later I can tell a story about the database (and other tools) selection process, but I’ve chosen Snowflake as the main analytical database, Tableau as the main BI tool, pure Python, and Redis (AWS ElastiCache) as the main data ingestion and the data transformation tools.

And here comes the question about data modeling. In a previous article, I’ve shown that, technically, you can just store JSONs in heaps and direct all analytical queries to them. But as data will grow, the complexity of each analytical query will also rise. In Snowflake, it means you have to either wait longer or choose a better warehouse (Small to Medium, Medium to Large, etc.), and pay twice as much.

Therefore, I decided to use the normalization and apply some modern data modeling approaches. I named it Anchor Modeling, but because it has some modifications (due to MPP limitations), you can think about it as a more normalized version of Data Vault 1.0 (satellites to links are prohibited, links to links are prohibited, single date historicity, each satellite stores just a single attribute, not many).

After this decision, I started researching the ManyChat business model (by the way, it was my first working day there). I scheduled a meeting with one of the product owners, and asked him, “Hey, what are you doing here?” And he told me a story: “We give our customers an ability to create chatbots to automate communication with their contacts. Inside bots, each customer can create broadcasts or flows of automation to send and receive messages to/from contacts via available channels. At that time it was just FB Messenger, but SMS and email were on their way. It was more than enough: I’ve got a list of nouns, or business entities (Customer, Bot, Contact, Flow, Channel, Message), as well as their relationship and attributes.

It looks rather simple, doesn’t it? From this step, everything looks even simpler:

Each noun (entity) is a table.
Each attribute or each relationship is also a table.
Determine what source system is a source of each entity (entity, its attributes, and relationships).
Automate the creation of tables from 1 and 2, automate their regular population by data from source systems 3.

I understand that I’m missing and simplifying many important steps, but I’d like to avoid making this article too long. So let me finish with the main quantitative result of those three years.

The first chart shows data growth. Volumes in TB — they are raw data volumes, before loading data into Snowflake. Such types of metrics are chosen to negate the effects of data compression and data duplications for time travel. This chart illustrates that ManyChat has a pretty healthy data growth rate, finishing 2021 with more than twice the data it finished within 2020 (155 TB to 65 TB).

More data, and more types of data available, gave ManyChat the possibility to hire more analysts and to do more and more analytical projects. The chart above shows the count of monthly analytical SQL queries, from analysts and Tableau BI, which is steadily growing from month to month. Moreover, in August 2020, ManyChat had four full-time analysts; in June 2022, it had 20 full-time analysts.

Here is a chart of weekly spending of Snowflake credits. The absolute numbers are hidden, but the pattern is clear: in August of 2020, credit spending stabilized and became almost constant. More data, more reports, more research … and the budget chart is still flat.

In the next articles, I’ll try to explain how this was achieved, but the main idea, in my understanding, is clear: proper data architecture enables us to forget about linear growth of spending (proportional to data volumes) and switch to the almost constant expenses.

Data Modeling Today: launching cost-effective analytics for ManyChat

Written by Nikolay Golov