Prepare the data… before analyzing it!

Petru Buzulan
Analytics Vidhya
Published in
8 min readJun 23, 2020

This post was born from the idea of giving guidelines on how to prepare the data before being able to analyze it. It is not a very common practice believe me, many times many of you do not even ask yourself the problem of structuring the data before using it because it is thought to be useless and a waste of time, but trust me, it will bring you many benefit doing it. During the project, for example, requests for new metrics, new analyzes etc… not initially foreseen, may arise during the task. How would you handle such a scenario? It would be almost impossible if you don’t create a clean and generic dataset first, if you do not have an optimal flow of the data. There are no fixed steps or procedures to follow to convert unstructured data into structured one considering every data is different but there are some best approaches to achieve that goal. In this chapter we will not talk about how data is collected, used or processed, instead, we will focus more on the best practices to follow to make unstructured data analyzable. In detail, we will analyze the following points:

  1. Single entry point
  2. Clean up the data
  3. Analyze the data
  4. Aggregations and reports

1. Single entry point

There are times when a dataset is already available to be analyzed, maybe from a NoSQL database, from a data source or even from a file (csv/excel) but very often, instead, you will have different entry point, both real-time and not from different sources. RabbitMQ, Kafka, files and many more can be used to collect data but what’s important is that you need to find a way to make all this inputs converge in a single point (it doesn’t matter if you use a tool, a database or a file) before the analysis. Let’s move on with some exampels:

Wrong (image 1)

In this example (image 1), we do not have a single entry point for our analysis part which does not give us certainty that our data will arrive as desired. Even more so these blocks are not managed by us but are microservices of other components. You can’t afford (and sometimes it’s uselss) to build a wrapper to standardize the data for each input source.

Right (image 2)

In this second example (image 2) instead, we have a single entry point; we have a final wrapper before our data analysis tool. This wrapper should be a part of our, internal development, we must have control of it.

Please note, the first example (image 1) is not completely wrong, in fact, our different sources can have their own wrapper but the final one cannot be missing. In that case, we should have put one last wrapper to aggregate the data before analyzing it. All these wrappers are needed to create a specific and desired model.

2. Clean up the data

All is set, we have all the data needed together, what’s now? Well, now the data cleaning phase must begin. Please note, this should be the only stage where we clean up the data! It is a highly discouraged practice to modify the dataset during the next steps. In particular, in this phase we should:

  1. Clean up the useless data
  2. Make sure all the datatypes are correct
  3. Normalize whenever possible
  4. Create the desired calculations and derivates
  5. From unstructured to known elements

Cleaning up the useless data

This stage can be grouped in 2 steps:

  1. Check for duplications and errors
  2. Remove all the unnecessary data

In the first step we need to make sure there are no errors or in a worst case duplicated data. It is very important to find duplicates as they could give us false-positive results during our analyzes (it depends on the type of data though, imagine analyzing logs, in that case we should not remove duplicates). In the second step we need to discard all the data that is not and won’t be needed for the future analysis. This is a very responsible job considering there may be developments that will require data we have discarded previously. It’s not easy to understand at first impact what kind of data we should discard so be sure to have a general overview and double check before taking a decision. But this has to be done, consider that this phase will save us a lot of disk space (database, files, etc…), will reduce further confusion and will save us from wasting time on data that are not beneficial.

Make sure all the datatypes are correct

Means that we need to check if all the data that will be used is in an usable format. Remember that statistical methods can only be used with certain data types, therefor, we have to make sure that dates are dates, numbers are numbers, geospatial points are in a geo-data format and so on. Let’s think about a column containing dates, as you well know, formatted dates are easier to be processed, almost all of the tools/libraries you will use have operations on date type fields. Another reason why we have to make sure types have their own correct type is closely connected to our database (or files). Imagine we want to save our dataset into a database, each data type is optimised for storage space in some databases, strings should be strings, chars should be chars and so on. Even analysis on excel will be simpler if the types are correct. Also, data types tells us how to visualize the data, therefore, do not neglect this phase.

Normalize whenever possible, always! (ref)

Normalization is a technique often applied as part of data preparation. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

As explained in this statement, it is not always possible to normalize but if possible, it will help us better understand the data. If you want to read more about normalization, I’ve found this post (ref) really exhaustive.

Create the desired calculations and derivates

In the last but not least stage we should “create” the needed data for our analysis. It’s better to have all the calculated fields already ready and not calculate them on the fly. Let’s think about a column having dates in timestamp format (therefore in numbers), you must not convert to date during the analysis. Another example might have been to need to perform an operation on two or more columns (imagine a simple average, subtraction or addition), or, even more trivial, convert birth dates into ages. Once you perform these operations, you will always have them available, otherwise, you will have to perform these calculations every time, which is not very convenient. So remember, calculate, add, remove edit in this phase, not later!

From unstructured to known elements

This is the phase where the Unstructured -> Semi-Structured -> Structured data conversion is finally done, in fact, at the end of this phase we should have the structured data ready to be processable. It does not matter where we store it: relational databases, nosql databases, files are all valid options, it all depends on the type of data you have in your hands. To conclude, to give a clear view of the data process (from the start to finish), this image explains the difference between the various phases of the data, from when it arrives to when we store it (ref):

3. Analyze the data

Finally, we are at the analytics part, this is the part where a data analyst shines. Tools have a central role here, because fortunately, today, the industry offers a variety of ready-to-use tools that allow us to analyze in a very detailed way. Many tools are integrated with the most used databases, others aren't but everyone does their job. Just to mention some alternatives to the much trusted, for some, excel:

  • Power BI (provided by Microsoft)
  • Qlik Sense
  • Tableau

… or less professional, but still great (and/or free) analysis tools:

  • Kibana
  • Graphana
  • Knowi
  • Python (libraries like matplotlib, plotly, seaborn)
  • Rstudio

The difference from the first group of tools and the second is huge if we are looking for something more professional, detailed charts, aesthetically and/or beautiful dashboards. Also, the possibility to create reports (we will talk about reports in the chapter 4). On the contrary, sometimes we only need internal business analyzes (that allow us to understand the progress of some behaviors), or simply, logs analysis. This is where the second group is more reasonable as a choice. As far as the analysis of the data is concerned, there is not much to say, it is at the expense of the data analyst to choose what type of analysis to do, what data to compare, etc… therefore, you cannot speak of analysis if you do not know your data first. I prefer not to go into this considering the post is not focused on data analysis itself but, if you want to know more about data analysis, take a read about Types of Data Analytics here (ref).

4. Aggregations and reports

We talked about structured data and the importance to save, at the end of the process, our dataset. But imagine you have data that grows, every day, you cannot store everything otherwise sooner or later you will have not indifferent problems about storage, hardware, etc… A good deed to do in this case is to divide the data into smaller datasets by performing categorization operations (per year, sectors or elements for example). Always remember, classifying and splitting the data will be beneficial for future reference in addition to having data stored in the “drawers”. It rarely happens to need the entire dataset again, therefore, making analyzes on narrower datasets will take less time and fewer resources. Another factor is the reports, they are a key element in the business world. There are various types, from the most detailed to the simplest, from the most important (consider that many reports also have a legal value) to the least and so on… but what’s important is to have all these reports divided into categories, dates and customers if possible. Reports are another method of analysis, for example, dealing with past reports can bring to the light losses, improvements or simply, changes. Before concluding, I want to remind you that reports and analysis are two different concepts.

Reporting: The process of organizing data into informational summaries in order to monitor how different areas of a business are performing.

Analysis: The process of exploring data and reports in order to extract meaningful insights, which can be used to better understand and improve business performance.

Conclusions

Understanding the correct steps to follow is a big plus if you are in the data world, in particular one of the one who makes the data available for a data analyst (not always the data engineer). I tried to touch and explain the crucial points of all the process from unstructured to structured and analyzable data. I focused more on architecture and on preparing data for future analysis than on analysis itself. To discourage you a little, I tell you that you will never have a rigid scheme to follow, unfortunately or fortunately (it depends on you), we have to adapt from task to task, every time. I repeat, what you need to know are the basic concepts.

Follow me on:

GitHub @ github.com/pbuzulan

Linkedin @ linkedin.com/in/petru-buzulan

Thanks for reading!

--

--