What you need to know to create great data analysis platforms

Let’s face it, every company is becoming data driven whether they want to be or not. Now, I’m not going to get into the pros and cons of this being good or bad, cause well, this isn’t what this post is about. It’s about, you, being a software-type person having to build one for company consumption. We’ll touch on the main points of having to build one a data analytics platform from its various pieces: gathering the data, ingest (ETL), storage, analysis, and retrieval. At each point I’ll discuss higher level design decisions and general concerns that can be seen at each stage. Diving down into the sections will be saved for additional posts. I’m also not going to get into the nitty-gritty of a fully-functional data analysis platform because that would take a full book and not a series of posts. :)

Gathering the Data

The first slice of the pie often seems the easiest, but, can easily turn into the hardest. That is, what data are you going use? There is so much data out there being created every day but you want to use data that is relevant to your problem. So, looking at this problem there are a few questions you’re going to want to ask yourself:

  1. What problem am I trying to solve and what data do I want or need?
  2. Where can I get this data?
  3. Does it require a subscription or is it free?
  4. How reliable is the source?
  5. What format is the data in?
  6. How often is this data refreshed / added to / modified?

There are more questions you could ask but these questions will help drive the next stage of the platform: the ingest (ETL) pipeline. Now focusing on each of these questions, let’s explore what it means.

When attempting to figure out what you want to solve it often helps to start at the other end of the data analytics platform and determine what questions you are wanting answers for. From this, determining the data you need becomes much easier. Once a few pieces of data have been decided as being needed, determine where you can get the data from.

If the data is coming from an internal source, is there a cost associated with accessing it? This cost could be transfer fees from a cloud provider such as AWS to your local cluster or it could be the time it takes to write an adapter to retrieve the data. If the source is a 3rd party source another question arises: is the data free or is there a cost? If there is a cost, is it a subscription?

Next, regardless of what the data source is, you have to ask: is it reliable? Reliable itself can have many meanings in a computer system, but for this we can say is it free from errors and do you trust the source of the information? If the answer is yes, this is good data.

After deciding the source is relevant and needed, in what format does the data arrive to you? Is it a csv file or a json document or a binary file requiring a specialized library to read it? This (and how the data arrives) is going to affect your ingest pipeline.

Lastly, how often is the data updated? Is the data coming as a real time stream or is it a batch download of multiple files on a regular basis? Based on this, is this acceptable to you? Does the data need to be real-time or can you live with the data have a bit of staleness? Like the format of the data, this will affect the ingest pipeline.

There are other questions that could be asked about the data you need, however, the six discussed here are a good starting point to thinking about what data you need. In the next section, we’ll discuss what questions you need to ask yourself when ingesting the data into your data analysis platform.

Ingesting the Data

The ingest pipeline, also called the ETL pipeline, is the area in the platform where raw data is cleaned, normalized, possibly enriched, and then inserted into storage. Unsurprisingly, we’ll discuss the ingest pipeline in the three stage: extract, transform, and load. However, there are cross-cutting concerns on each of the stages, which are:

  1. How long will it take to process?
  2. Do we save the data at each stage? Raw? Partially-processed? Load-ready?
  3. Can we parallelize?
  4. What can we monitor?

The first stage of ingesting the data is receiving (extracting) it. We need to figure out how to start the pipeline with the given data. Looking back we discussed how we were going to get the data we want. This directly drives the extract phase. For many cases are we going to write an adapter that pulls the data on a set schedule or are we going to allow a human (or another process) place data into a set location that the pipeline then automatically begin processing when the data is seen in that location? Also do we need to include any type of metadata with this data? Very little to no processing is done in this stage; if anything, I would only add metadata about the raw data to aid in processing later.

The extraction phase is often very quick compared to the transformation stage and should be slower than the loading phase. In terms of saving the data, personally, I’d save the raw data so that if in the future your ETL pipeline changes, you can reload the raw data if you so choose. For parallelization, there might not be much you can do here. In some cases, it might make sense to do a little processing such as if you’re loading a very large file, if it is cuttable, break it into a series of smaller files. However, make these choices based on the entire ETL pipeline and not just the extract phase. Finally, you want to be able to monitor your pipeline, for this, you need to consider what you are concerned with. If you can’t download data from a site or the data checksum is not correct, do you care? (you should) Additionally, keep track of the times to download or receive the raw data so you can notice if there are problems with this stage.

The next stage is transformation. In this stage, it’s assumed we’ve gotten data staged and ready to be processed and refined into data that will load into our system. This stage could actually be many stages linked together to clean and process the data. In many cases the transformation could include stages such as: normalize, sort, remove duplicates, enrich data with given metadata, and convert to csv.

The transformation stage can often take a long time, however, it is also the best place for parallelization! In very few cases can the data not be transformed in parallel, so this is the time to get your speed gains. Additionally, depending on the processing that you are doing, it might make sense to save partially transformed data so it can be used for other refinements in your pipeline. For instance, if you are performing some machine learning techniques, partially transformed data can be used as input to these functions for refinement. Monitoring these individual steps in the pipeline can also give you insight into areas for potential speed-up or if there is an issues.

The final stage is load where the transformed data is loaded (unsurprisingly) into the final data store (or data stores). The work here is often very simple and requires pushing the converted data into the storage as quickly as possible. In many cases, the data stores have specific routines and methods to handle bulk loading of data.

As you’ve guessed the pattern by now, we’ll talk about the cross-cutting concerns of the load stage. This should be one of the quickest stages to run in your pipeline. The data is already formatted for loading into the data stores, the data stores should have an efficient loading technique. This is data that you could keep around if you want to easily reload the data, but in general, I believe this is data I would not store since you can simply rerun the pipeline to regenerate it. This should be a highly parallelizable operation as well. Metrics-wise, place metrics on what you can: time to load, any errors found while loading, and the like. And now that we have data being loaded into the data stores, let’s talk about them.

Storage

Ahh, storage, you’ve done all this work to get your data here: the center of the system and the source of truth for all your work. The choices here can affect the ingest pipeline, analysis, and retrieval. The main questions (ignoring cost-based ones) that have to be asked here are:

  1. What are your analysis use cases?
  2. What are your retrieval use cases?
  3. How is the data going to be shaped (e.g. what is the data model)?
  4. How much data are you going to store?
  5. What properties are needed of the data storage? e.g. handle massive amounts of reads vs. writes.

You might have noticed I arranged the questions to mention the analysis and retrieval use cases first. This is no accident. Understanding what you want to do with the data will give you better insight into how you should store the data and how much you intend to store. So, stepping into looking at the two main use case questions: analysis and retrieval. The whole point of building a data analytics platform is to be able to ask questions about the data and gain insight into it. With that in mind, much when you decided what information to ingest, think about what you want to ask of it. These scenarios might be:

  1. How do the number of widgets through a build process change based on time of day for the past year?
  2. What are the number of widgets that failed quality check on a given date?
  3. What was the rate at which widgets were shipped out to customers across eight hours last Friday?

In these three cases we’re looking at a few constraints: widgets, date and time range, and number of widgets total and failed. When thinking about what data you want to retrieve, it could be:

  1. Give me a list of destinations for widgets produced on Monday.
  2. Give me the total number of factories producing widget X one month ago.

Again, we’re looking at a few choice pieces of information: widgets, shipping destinations, factories, and time/date.

With these use cases, you can tell that the data important to your storage system is going to contain: widgets, destinations, factories, and time stamps. From here you can figure out what storage technologies are needed and what should be stored. In this case: is time series data what you need? Or do you only care about aggregate values and a standard RDBMS will work? Next, how much historical data do you need? Based on the queries, it appears about a year worth of historical data is required. From that, how much data is that going to be size-wise? If in an RDBMS, how many rows and columns? What data types? If it’s in a storage system like Cassandra, again, what are the keys you’ll be using, what data types? What are the rules for dropping data? Finally, given the use cases and what is being stored, what are these additional requirements on the data storage system? Do you need read replicas or is the single database enough? Do you need a large Cassandra ring? Again, there are many questions you can ask, but only you can truly decide what matters to you.

Analysis

With data now stored in your system, it is time to do something with it. At this point, you might want to interface with a system such as Tableau, use Hive to query the data, or perform a batch operation on the data using Hadoop, or possibly do all of the above. Additionally, there might be some code that must be written to allow external tools to gain access to the data. So, on this side the concerns are:

  1. Is the analysis ad hoc or a structured report?
  2. If the analysis is through an external tool, do integration layers need to be written?
  3. What is the process to take an ad hoc analysis and turn it into a structured report?

Now, the analysis. If the goal is to do ad hoc reporting, does the data need to be copied out of the main data store and into a temporary data store for analysis? If so, how is that going to be done? And how long will it take?

If you intend to do your analysis through an external tool, does the data store you use provide the correct interface for access? For instance, do you need a SQL driver to a key-value store data source? If so, does it even make sense to store the data in a key-value store instead of a SQL store like PostGres or MySQL or Oracle?

Lastly, once an ad hoc analysis is found to be useful and you want to turn it into a structured report do you have a process for moving from ad hoc to report? If so, what is it and is it reliable?

Retrieval

Finally, the last aspect of data analytic platforms. Retrieval of data for something other than analysis. This includes: exporting data to other systems and using the data for something other than analysis such as dashboards and the like. Some concerns here that are beyond the analysis stage:

  1. How much data do you expect to be accessed at once?
  2. If it’s for a dashboard or display, what data do you need?
  3. Is this data something you can pre-compute or is it going to be part of a complex query?

Since retrieval is really a follow on the analysis (the use cases overlap quite heavily), we’re focusing on additional details that retrieval might have. Firstly, how much data do you expect the average retrieval to be? Is it going to be a bulk export use case or a more pointed query for only a small section of the data? If you’re using this data for a dashboard, should this data be calculated beforehand and placed in an additional table that is easily queryable and doesn’t require a large amount of processing? Or is it something very simple such as calculating a sum across a few rows? Thinking about these use cases will bring implementations into focus and allow you to make the right decision.

Conclusion

First, thanks for making it this far! It was a long read and much more than I was expecting to write when I started out writing this.

I hope you’ve found the breakdown of a data analysis system worthwhile and the questions posed at each stage helpful. Although, each stage affects one another so it’s hard to truly separate them since they are all interconnected. In my personal opinion, decisions can be made on the fringes (data ingest, analysis, and retrieval) that will not affect the overall system adversely if further clarification is needed. Conversely, decisions at the center (data storage) can often be shielded from decisions at the fringes since the requirements oftentimes will deal with conversion of data from one form to the next.