How to scale Data Management ?

Maxime Thomas
Nov 30, 2017 · 12 min read
Collecting data is serious matter — with no data, no good decisions — Courtesy of Quotefancy

When starting a business adventure, the data you may collect is very important as it will lead your vision of your product and your business. Data gives you indirect feedback of how people are using your platform and how they consume it. When growing on the data team, you will get new challenges from raw data and basic analysis to building dashboards for global governance.

It’s easy to get some free analytics tool and implement it in your different clients and have basic analysis about users browsing path or amount.

You shall have difficulties to manage all that data :

  • collecting data from those free tools: it’s going to be expensive or limited;
  • normalizing everything : you need to script transformation macros or whatever to enrich and fix your data;
  • dealing with several sources of data as all those free tools are specialized on a specific data collection (only mobile device, only for ads, etc.);
  • dealing with several destinations once you have collected data;
  • handling availability, coherence and freshness of the data;

In this article, I describe why it’s important to focus on the correct level of data depending on needs, skills and maturity of your company. We also talk about solutions and how to implement it, it may give you a good look on what options are available at this moment to answer to all those issues.


Big plan for Big Data

Transcript from executive committee meeting :

We have a big plan for big data, we are going to hack the market, provide best product to our users, and maximize income.

We are going to collect everything, from FB Connect to strong personal data and are going to exploit it in real time to get a reactive response from our platform so users can be more engaged and it will increase the transform rate to 80%.

Reports from the last 5 years show than we have really well performed. We can compute any metrics as we know the semantic of our data. We use these data to predict the trend for next year.

This is more or less the act of faith of the digital entrepreneur / C-level on the data perspective. We know what we want, now we have to understand how to get it.

Some people may believe that the data driven way is too much when you are in a small company, it’s exactly wrong. Data is the wheel of your car, without it you can’t drive, you just go somewhere your car has decided.

In the following sections, I give keys to understand the big picture that goes with the data driven approach and all the issues you may encounter in the process of handling data in your company.

Company vital constraints

First of all, you have to understand the implied constraints of a data project in the company which is the first brake for data management :

  • Cost : it costs money, it’s not free : it’s not because it’s simple to implement that it is free. You have to consider time of dev / ops on implementing and designing the solution;
  • Availability : the solution must be available and should produce data analysis in a decent time frame, else it is useless;
  • Scalability : the solution should handle large and large amount of data;

Data is not a subject that can be processed aside, it’s serious, bulky and some time tricky. So, it really should be discussed with all the stakeholders and started as a real project (not a feature on one part of the Android new release).

Product and Business data driven methods

The second step is to understand who is going to consume the data generated by the platform users. From my experience, there are two approaches :

  • Product teams : I want to know how my user is consuming his product and why they stopped at step 2 and why another one completed the full path to the action; I want to know how I can engage more my user to use my product;
  • Business teams : I want to know how to maximize gross market value and learn which commercial actions can be used on users to transform more efficiently;

Those teams are basically trying to justify their existence with those metrics. Intakes are not necessarily compatible but the collected data should take care of both aspects. One way to consolidate both approaches is to design data generation from a consumption path which is very similar to a classic e-commerce funnel but applied on each feature of your product.

Consumption path for a feature

By this way, you can segment your consumption to identify causes of success (nominal path) and cause of give up (abandon path) and finally get the core product which reached both enthusiasm from users and satisfaction from customers.

State of the art in Big Data

There is a lot of literature on the subject but one author, Nathan Marz, has reached a precursor rank with his book Big Data : Principles and best practices of scalable real-time data systems.

Companies have evolved and big data automation has brought some comfort in data consumption. Data freshness is split in two categories, hot and cold data. Hot data is considered as it’s just out from the oven. Cold data is on the other hand something you put on the fridge for latter consumption. This distinction is important as data collection may not differ but data aggregation, enrichment and composition should.

Nathan Marz distinguish two layers :

  • the slow layer which is the classical data management from past decades (Business intelligence, batches and so on);
  • the speed layer which is a stream approach of data management, allowing to access directly data during its ingestion (think of the real time component in Google Analytics);

It also has an impact on the data core process of the company.

The slow layer is handling core business data, related with money, users, acquisition, all metrics a decent investor should look at before investing. This is real stuff, it has to be rigorous and unalterable. It’s slow to implement.

The speed layer, a contrario, is the perfect field to either build new metrics or observe specific behavior to determine a trend or confirm an intuition.

We then consider three stages for data :

  • data lake : raw (and wild ) data, you can fish there and get a wild fish;
  • data warehouse : the data has been industrially collected and conditioned but it is still from a low level of semantic, you can take a frozen fish from the freezer;
  • data mart : the data is branded and accompanied with its story and extra content, you take a pack of Captain Iglo’s.

Data lake is the reference and is collecting the pure data, without alteration. Other stages are totally expandable.

Global platforms considerations

Finally, architects readers will agree on the fact that global platforms are bringing new ranges of data volumes and freshness that makes data solutions a bit more complex.

Data isolation matters

On a technical point of view, we should considerate two kinds of technical data : the data that is useful for the product and without which the product could not work, for example user evaluations; and the data that is built or copied for analytics and statistical purposes.

The point is to separate durably those data so intakes from product and data consumption should not interfere. For example, a guy from the data management team that asks to change the product because the shown data in the product is not the one that should be used on the data side.

It’s why the data for analytics purposed should be copied and isolated.

Shifting paradigms

And last but not the least, data management has evolved from a discrete approach to a continuous approach. I apologize about the inaccuracy on the semantics to my dear mathematician readers, I lack a better term for this.

Formerly, engineers took snapshots of data in the data source (SQL dumps, static files, and so on) to consolidate a state of the data at one moment. This is quite efficient if 1/ you don’t have a big volume of data, 2/ you don’t have many sources, 3/ you don’t need to do complex requests on the data.

Discrete approach on data consolidation

However, if you consider the life cycle of the data, you will notice that those snapshots are not really representing the life of your data but only instant pictures of a given moment of this data. Your analysis will fill the gap between two snapshots. And if your snapshots are distant, you may loose crucial information about the trends.

It’s why full event driven model appeared so your data systems could rely on a more continuous approach. Because what is the most important is not to know the state of the data (ie metric = 42) but mostly its trend (metric is growing fast +25%). The event driven model gives you the granularity and the loose couple to reach a modular refine data. By this way, it is also well designed to fit the consumption path of Business and Product teams.

Continuous (aka more discrete with higher granularity) approach for data collection

The more accurate and generic you can be in the design of your events, the more information you will get for later analysis.

This approach is perfectly compliant with the consumption path described above and gives more semantic weight to what to tag.

On classic tagging, events are collected at the end and not along the user journey which results to fat event
On extreme tagging, events are sent along the user journey and you should recreate the previous fat event from collected data

In this section, I will present some solutions which are examples of common data architectures. The aim here is to present the choice of the solution regarding the required skills on IT and Business teams or the effort to give to implement it.

Imponderables

We can consider some aspects that are real constraints on the data collect :

  • tag plans and event implementation are a real cost for the company, and moreover developers hate to do that;
  • it’s long to test and to prove that everything is correctly set up;
  • client technology, specially on mobile device that are not always on wifi, may struggle data collect, so consider very early the use of a specific SDK or a tag manager.

Do it yourself

Creating a small tagging system is easy, you need :

  • A client library;
  • A endpoint to receive data;
  • A good storage solution that will be queryable;
Tag is simple — just need an API and a DB

Client library will be loaded and connected to what you want to listen to, for example the user is clicking on a button. When the user is doing that, you collect information about it (which user, what button) and build an event object.

The library will send events to the endpoint which will be as simple as a REST API endpoint and will store the data in a database or a nosql system.

Pros

  • Easy to implement and to maintain;
  • It’s not expensive;
  • Does not require specific skill, a web developer can do that;

Cons

  • Not very suitable for high availability : when too much requests arrived on the endpoint, it will take time to write everything down and the client will have to wait the answers from this service;
  • Your storage system will quickly grow as the amount of events will arrive, you will have to forget some data if you want to maintain an effective system;
  • If you add some sources, you will have to code specific integration in your system and deal with several formats from the different providers, it also will contribute to make your storage grows up;
  • It is also perfect for a small team of business and product guys, more than two people and complex requests will bring this solution down;

This is perfect for a startup that wants to start to collect data but do not have resources.

Use SaaS solutions

This solution relies on the use of SaaS solution that helps on several ways to handle data collecting and provides :

  • SDKs to quickly implement their solution in your product;
  • Platform access and hosting so you can rely on efficient infrastructure;
  • Out of the box integration to plug in other systems

Your client use the SDK to enter data inside an already designed event. The SDK sends seamlessly events to the platform. Platform is distributing data between sources and destination.

Pros

  • Handles infrastructure complexity for you;
  • Save you time to do yourself the integration;
  • Ensure you to not loose data;
  • Somehow gives you an unified format for your events;
  • Still requires you a basic engineering skill set to implement it;
  • Helps you to test a lot of marketing / advertisement solutions based on data without really integrating them;

Cons

  • Integrations on those systems are sometimes not always complete, so you have to watch if your source / destination system is fully supported or compliant;
  • It’s difficult to test before sending in production;
  • It’s expensive and price is going to scale with your users or your separated accounts (for example an account = a country);
  • It does not fix the problem of the data expansion;

This is perfect for a company that started to have a real data management and life cycle. Not handling data will lead the company to take bad decisions and this is critical. However, it’s complex and we don’t have time for this, so we need to be quick and have data collection done for yesterday.

Integrate a cloud platform

The last solution is to fully integrate a cloud platform, I give here an example of what could be done on AWS as it’s what I know the most but you can definitively find equivalent solution on other platforms.

The API Gateway is configured to receive specific events from your clients. Kinesis Firehose is set up to split the stream of event in two : one for cold data (the slow layer) on S3, and the other one in a relational database called Redshift. The transformation logic is done in Lambda functions.

Pros

  • Soft and slow collection layers are handled by API Gateway, Firehose and lambda which are serverless;
  • This solution is global;
  • S3 is the data lake so it is storing everything meanwhile Redshift is the data warehouse and it collects only the last x months (for example). This is solving the data expansion issue;
  • Completely elastic system that provides you fault tolerance, high availability, high performance, durability;
  • You mostly pay as you go for the computing part and you can handle both slow and speed layers on predicted budget;
  • Once the data collected inside your Redshift, you can dispatch it again to other external solutions, you are sure to get the freshest data;

Cons

  • You need Cloud skills and a strong data management team to design and implement data analysis in the data mart;

This solution is for mature companies that are already mastering their data.


Conclusions

Data Management can be complex to handle if you do not focus on what you actually need. The main trap here is to want every data now. It leads to focus on data variety not considering if the solution is designed to take the load on higher and higher collected volumes. Start small, change often.

Intakes contradictions may be another problem as you will have to deal with business and production scopes wars. Data isolation and event approach should reconcile everybody around the table.

The data ingestion with the different stages of industrial refining is another serious concern as you will spend a lot of time doing and undoing things around the data. Data warehouse is the data foundation, you should, whatever the mean, start by this as all the other stages are generated by the collected data.

Considering free tooling, DIY, SaaS or cloud solutions, depending on your needs and maturity, you should consider two aspects : security and availability.

As your business team is connecting a lot of data marketing and analysis tools on top of your data warehouse, a lot of data starts to get out of the scope of the company. This is a major concern as you may hear all days data leak and breaches.

The other point, availability, could also be a struggle for the company. Hosting a big data solution is a real job, not something that should be done on a table corner.

Cloud providers have understood this and provide more and more PaaS solution to ease this hosting, in your own infrastructure, guaranteeing by this way an optimal security. The sooner you can go on this, the better it is.

Feel free to comment and start the discussion about data management.

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by 270,416+ people.

Subscribe to receive our top stories here.

The Startup

Medium's largest active publication, followed by +537K people. Follow to join our community.

Maxime Thomas

Written by

Tech addict.

The Startup

Medium's largest active publication, followed by +537K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade