A Song of Data and Fire: Building Bnext Wall (Data Lake)

Ignacio Aranguren
Jun 10 · 11 min read

Article by Ignacio Aranguren co-authored with Jacobo Miralles

Big Data cuts deeper than swords

In the dawn of the modern revolution enhanced by Artificial Intelligence, Big Data is not a nebulous horizon anymore but an indispensable ally to aspire to conquer the Iron Throne of the fin-tech business. Winter is (always) coming to the financial world and the ones prepared to manage and analyse the growth in the volume of data nowadays collected will have more odds to succeed in the battlefield. Big Data isn’t a pit, Big Data is a ladder.

Three Big Data mind-boggling facts

  • According to estimates, in the past year, more data has been created that in the entire previous history of the human race.
  • Poor data can cost businesses 20%–35% of their operating revenue, and while 55% of professionals have no access to the right information at the precise moment, there is still a lot of people that access to confidential data in a daily basis.
  • At the moment less than 0.5% of all potential data is analysed. A silence that yields more than questions.

And that’s what I do. I analyse, and I know things.

When you play a game of thrones you win, or you die. There is no middle ground. And as this new era paces, several fin-tech startups aspire to be the unexpected inheritor to the throne or fail trying it.

Fin-tech startups were born with technological DNA and they claim to be flexible enough to change the current financial paradigm, by providing the essential tools to process the data and, at the same time, to precisely identify trends and adapt business strategies consequently. Their focus is on intelligent micro personalisation and on providing the end-user with non-intrusive customer experience.

On the other hand, fin-tech resources are more limited than their major and wealthier competitors in Fin’s Landing (sorry traditional banking, you are the Lannisters here), who have recently started to move their tropes. And the future of the Seven Kingdoms is hanging by a thread.

In Bnext, we have gone through the adventure of transforming raw data into something meaningful, valuable and understandable. A process that is (or can be) fearful as the Long night, and we want to share our insights with you. Hope you enjoy the read.

Before the GCP Wall

Every flight begins with a fall and before we built our Wall in Google Cloud Platform (GCP), the Bnext Data Department was dark and full of terrors. We took the decision to build our Data Lake totally separated from our App Core Data servers running in Amazon Servers.

Like most startups, our Analytics Data Base was MySQL.

We decided to go for Google due it is more SaaS oriented versus a PaaS solution that requires system engineers. Google is definitely the Targaryen House of software developers. In less than an hour, we created the Bnext organisation and our first Project in GCP, and in a few minutes, we were able to create a SQL instance. We entered with emptiness in our hands and came out with three dragons, thinking that most of the work was done. All we had to do was to create a copy of our Core Database in Google and start exploiting data, right?

Well, not quite. The Core Data Structure we had back then did not help at all. Some initial decisions made us separate our Core Data into two separate Databases. A privacy measure ended up being a problem, with no real privacy advantage. (Please, some Red Wedding music here). The first of many issues we had to tackle, the first of many battles for the throne.

A SQL Database in Google and the Data Core in Amazon, what next?

Basically, our first approach was to develop an ETL process that took place each morning. By “taking place” we mean that we had to do it manually every morning, as we had not yet automated the process.

Our ETL process had the following steps:

  1. Data Transfer using Navicat to copy the two main Data Bases in Amazon into Google. Let’s call these the “Baratheon” Database (the one with user account’s info) and the “Greyjoy” Database (the one with card transactional info).
  2. We then created a New Database, the “Stark” Database with all tables from both Baratheon and Greyjoy. A Database to rule them all.
  3. Once all data had been transferred to the Stark Database, we had to run a query that transformed the original tables into new tables easier to analyse. For instance, we created a new user table and added variables that could be potentially used for segmentation purposes, like “average monthly spent” or “acquisition channel”. This query started taking us 5–10 minutes when we had 5 K users. When we were 50K users (think about the subsequent exponential transactions growth) this process could take up to 2 hours. Each time the process failed, we had to manually restart it again. This process ended up being like the White Walkers to the Seven Kingdoms: our biggest fear and rival.
  4. When the data was correctly transformed in the proper SQL tables, it was time for visualisation.
  5. At first, we used Google Data Studio for simple dash-boarding, but we soon started using Tableau for cohort analysis and more complex processes. We ended up having huge processing efforts in our Tableau system, what ended up becoming another issue, as Tableau Dashboards could take ages to load or it could even fail.

This process shortly explained in five steps, ended up being a nightmare. For important reports we had to wake up at 5 am in the morning as we couldn’t risk lacking the right numbers on time.

However, the freedom to make our own mistakes was what we ever wanted (that is why we work in a startup after all) and although the battle was tough, the White Walkers ambush was finally defeated. And the GCP Wall came. Once you’ve accepted your flaws, no one can use them against you. Let me now explain that in more detail.

The GCP Wall.

“Why is it that when one person builds a wall, the next person immediately needs to know what’s on the other side? Let us quench the thirst of curiosity. There are almost infinite available products in the market for Big Data purposes. They all claim to be the rightful inheritor to the throne, to have the best tradeoff between managed services and flexibility. That they will serve the people with honour and justice. However, we put ourselves on Sam’s shoes and after extensive research, we finally chose Google Cloud to delegate the management of most of our data needs.

Google Cloud to delegate the management of our data needs.

As for other successful startups such as Cabify, our highest bet was on Google Cloud Platform. We chose Google’s BigQuery as a distributed SQL-like solution and Google’s Data Store for holding NoSQL data structures. For dash-boarding, we continued using a Tableau connection.

For us, it was important to:

  • Do fast analysis — allowing deep & precise analysis to be executed quickly and efficiently.
  • Bring power to each department — understanding each department needs and providing them with qualified tools.
  • Make data-driven decisions — as every decision must involve data at some level.
Bnext departments dynamics

In an endless world of possibilities, why choosing BigQuery?

BigQuery is a Google’s serverless, highly scalable data warehouse, designed to make data analysts more productive with an affordable price-performance for early start-ups. Scalability is one click away, it allows for aggregation of sources of information and complex queries can be performed efficiently in seconds.

By using BigQuery (BQ) as our data lake we aimed to make our (future) analysts and other teams more productive by accelerating data extraction. Running queries in BQ takes seconds whereas running the same query on the former structure took a long time. And if power relies on where humans believe it resides, for start-ups time is definitely the most precious asset.

In Bnext, data comes from several sources that are constantly feeding the warehouse for the battle. You may find now briefly explained the data journey from beyond the Wildings fields to the GCP Wall.

  • First, the transactional data, user’s core info and acquisition attribution data are collected with an in-app SDK by our backend system. Then, it is lodged into our Core Data Base in AWS after some API calls and some developers magic (Lord of Light, come to us!). Finally, the data is then transferred to BQ via our ETL system where the journey ends. The latter process is held in a Virtual Machine inside Google’s Compute Engine and triggers automatically every day for the sake of our BI team. Yes, as you imagine, our IT team is the Red Witch.
  • User’s core and transactional data is joined with app usage data coming from Firebase, a dragon from Google’s Targaryen House which fires all the data to our system with a single click.
  • Marketplace heterogeneous data (financial products data can be a nightmare) is more complex to manage and is collected in a NoSQL architecture in Cloud Datastore before it is streamed to BQ. A script (thanks again Red Witch) allows us to structure the information from the Marketplace and plug it in raw tables for further data exploration.
  • Finally, other platforms are also used to nurture our data ecosystem such as MOCA for geo-location purposes and Intercom for customer interaction and communications via API.

Our renewed BI Data architecture is shown below:

Bnext BI Data structure

How do we organise the data once it is in BigQuery?

Data is divided in five main datasets: Raw tables, Master tables, App usage tables, Stream tables and Cubes.

  • Raw tables contain data directly from the core AWS Date base (remember the daily ETL process) which allows replicating the “Stark” Database” that had all tables from both user and transactional info, along with attribution data coming from Adjust.
  • Master tables contain useful joins, with information that is not present in the core DB but that is valuable for nurturing our user’s profiling and transactional data. For example, to know which postal code belongs to each province/city or which character belongs to each House (sorry for the mess here but there is only one George R. R. Martin).
  • App usage tables, these are partitioned tables with hashed user ids that come directly from Firebase, which provides the quantified information to the Product team (Funnels, testing validation…)
  • Stream tables, containing real-time data on Bnext Marketplace.
  • Cubes, an Iron Dataset built from the swords of other vanquished datasets, fused by dragon fire. This is, a dataset built from SQL queries that allow gathering relevant info coming from all the different sources. These are key for the Bnext depts. as they provide valuable insights for relevant use cases such as monitoring spending patterns or using data on predictive churn analysis.

How do we use data to drive actions?

Tableau is our main BI solution for data visualisation and visual analytics. It’s universality when comes to connect to any data source and the depth of analysis that can be inferred from its use made us disregard other alternatives such as Google DataStudio, which we still use in specific cases. It was a hard decision to drop Google on this one, but the heart lies and the head plays tricks with us, only the eyes see true.

Tableau allowed us to democratise data in this friendless world of data, with endless columns, rows and horrible creatures. At Bnext, each department has access to their relevant data for visualisation and insights. They are always being updated and improved through internal feedback and good work of our Analysts.

Yet, for being the chosen one to sit on the Iron Throne, it is not enough with being just reactive to your data. You need to make accurate predictions, to anticipate your opponent and allies’ movements. You need memory and you need the future. You need Bran the Broken, the Three-Eyed Raven.

From BI to IA: Searching for the Three-Eyed Raven.

So far, we went through the adventure of how we mastered the ability to measure and communicate Data, which still conforms the foundations of our BI department. But that’s only steel. Fairly good steel, but not enough to survive the digital revolution, where the White Walkers are always waiting for the comeback. We need Valerian Steel, Glass Dragon and a Three Eyed Raven. We need machines capable of reign in the realms of uncertainty and that is capable of automating complex processes. We need to get into our customers head, understand how they behave when using our product. We need to give the specific product at the precise moment and to the exact client. To go from a data-driven culture to an algorithm-driven culture.

We cannot tell you much about this last chapter as we just started our way to find the Cave of the Three-Eyed Raven. Spoilers alert: Current use cases that are under development in the Data department comprise churn risk profiling, propensity modelling to support engagement strategies, customers clustering based on lifetime value and financial product preferences or transactions categorisation to detect spending patterns.

However, in this new search, it is pretty sure that we will find some struggles. In the battles to come, we have the challenge to:

  • Find the right balance between success in project results and not losing the department long-term vision. To impact with scalability.
  • To avoid creating complex plots (black boxes) that are not really understandable nor models that do cool stuff but that are not aligned with the core business vision.
  • The final end of a model is to give better service. ML models suffer from time degradation and should be constantly trained.
  • After a project is deployed, data is no more static but dynamic. To manage this whole new ecosystem it is crucial to determine the model’s quality thresholds, team responsibilities and GDPR compliances.

If we manage to accomplish all of the above we will hopefully soon find ourselves saying: “I remember what it felt like to be a Stark Database, but I remember so much else now”.

Conclusions and learnings

Every story has an end and while death is so terribly final, life is full of possibilities. Let’s go now through the conclusions and learnings of this Song of Data and Fire. Hope they are useful.

  • Big Data and AI are not coming anymore. The new technological paradigm is happening and is the perfect excuse to change the course of traditional financial business models. And fin-tech startups want to lead the race to the throne.
  • Trust third-party companies. The process of transforming raw data into something useful and user-friendly can be dark and full of terrors. Choose your first steps carefully focusing on scalability, flexibility and UX. If you want to avoid nightmares, you might want to delegate the management of (most of) your data needs. In our case, Google and CloudFramework.io have helped us to face despair providing magical artefacts and warriors with experience.
  • We really recommend giving a try on Google’s Cloud Platform services. Google’s BigQuery for SQL and Google Data Store for NoSQL data structures with a connection to a powerful BI platform solution like Tableau have worked for us.
  • Enrich your data with multiple sources as it goes through your system. Methodologies and systematic processes are important but do not forget business strategic vision or you may find yourself lost in a White Walkers Data Trap.
  • Artificial Intelligence and Machine Learning models should be designed to take the most proactive decisions. Survival in the digital revolution will be about understanding the client’s behaviour and providing the exact product at a real-time scale. Technology is not an end but a meaning to bring a data culture to business and to support the company’s core strategy.