Data Visualization : Brief Overview & Best Practices

Why Visualize?

We are living in an age of data explosion; in the sense that the world is accumulating Terabytes and Petabytes of data each day, 7 days a week, 365 days a year. The processing power required to go through it is in abundance, as the Moore’s law predicted, and we have developed advanced Machine Learning algorithms to go through it to find patterns and relationships. But even though our algorithms have evolved, our general conscience has not. We lack the ability to merely glance at rows & columns of a spreadsheet, peppered with numbers and be able to derive meaning from it. Vision is the strongest of our senses, and for that very reason, movies provide a more immersive experience than books and we believe what we see more easily than what we have read or heard somewhere. So, when we are dealing with massive amount of data, it becomes prudent to use tools to effectively visualize the data.

Data visualization is a representation of data in such a way that the trend / pattern / relationship that we want to focus on is communicated properly to the intended recipient(s). There may exist multiple relationships among the columns of a spreadsheet or among the keys of a key-value paired database. Depending on what information needs to be derived from it, different ways of representation may be appropriate.

We all have been touched by the Data Visualization in some way or another. The college presentation that you created to show the flowchart of a process, or when you saw that air pollution has risen in the past decade on the news through a line chart, or when at a company meeting, someone used a pie-chart to show which categories contribute to a certain parameter and by what percentage; all these were examples of data Visualization. It can be as simple as plotting a linear regression of a data distribution or can be as complex as representing the wind speed pattern over a certain region of interest. But what happens when we are dealing with massive amounts of data? Visualizing the data is no longer only a matter of “creating a chart”; that is only true if we assume there are no anomalies or errors in the data…no “noise”, if you will. But that is often hardly the case. So, the task of creating an effective visualization includes all sorts of assumptions about the data itself and attention must be paid that the information that we are trying to convey should never get drowned in the mesh of lines and colors within our visualization.

During the course of my upcoming posts on Data Visualization, I will go through some of the components of understanding and creating an effective visualization. We will take up topics one at a time and discuss about Data Preparation, Data Types, Types of Charts / Graphs, Elements of an effective visualization and finally, I will discuss some of the common mistakes to avoid while creating your own visualization. At the end of the series, you shall have the knowledge to create a polished visualization as well as evaluate the visualizations created by others.

Data Preparation

In my last article, I discussed the need for effective Data Visualization and listed some steps required to create a polished output. As is the case with most problems, the first, preparatory component of the process is the most cumbersome one. This is where we spend most of our time and this is the most taxing activity, especially when one is dealing with humongous amount of data. Data Preparation is essential because the primary assumption of working with data is that it is clean. Incorrect and noisy data need to be trimmed and repaired to fit the proper schema validations and logical requirements if we want our visualizations to give us a meaningful insight. Data preparation generally involves handling the following anomalies:

Missing data

How do you create a chart out of something which is not there. Simply, if a ‘Student’ data set is missing values of the ‘Gender’ field for some pupils, you can not use it to accurately display the ‘Gender-Split’ in a class. Data might be missing due to a variety of reasons. Data might have not been available altogether when the data set was being compiled. Or, maybe due to some communication loss, some data was corrupted, resulting in a ‘Null’ value being stored in the place of the illegible fields. Missing data creates a bias when the visualization concerns the spectrum of values it takes. The data set might be missing a possible value or an edge-case altogether. For e.g. while measuring the temperature variations at a geographic location, due to the sensor failure at extremely low temperatures, some of the lowest values could be missed. This incorrect measurement results in an incorrect ‘Lowest temperature’ reading for that particular location.

There is only so much that can be done when the data itself isn’t present. The most common strategy to handle this is to ignore the complete record where some ‘crucial’ fields are empty. In not-so-critical cases, particularly in a large data set with very few occurrences of missing values, field-level calculations may be done assuming the few absent observations don’t skew the visualization in a big bad way. In some cases, an attribute-mean is used to fill in Null values so that the records can be used and any calculation on the Null value fields does not results in an exception.

Errors & Outliers

Errors in values are rather easily detectable as they are bound by business logic pertaining to that field. For e.g. a mobile number field in India is essentially a 10 digits number. Or a percentile field can’t have a value more than 100%. Errors can be found if there is a way to sweep through the data fields while validating if the values are valid. For large data sets, this is done by automatic scripts. When found, the value may be edited to pertain to the allowed limits or if the necessary, the erroneous records are ignored while visualizing the data.

Outliers are values which don’t ‘belong’ to the general pattern in the data field. For e.g. in a ‘Student’ dataset, the ‘Age’ fields in a particular class may range between 12 to 15. A student record having an ‘Age’ value of 24 would be an outlier in the current dataset. The data may be valid and there may be a reason for a relatively elder student being enrolled in that class, the record skews the visualization nevertheless. So, if the focus isn’t on detecting any such cases, this record must be ignored. Please keep in mind that the outlier data may itself be a valid record or may have been the result of a typo. Treating it as an erroneous or a valid record depends upon the data analyst.

Inconsistent Values

This problem generally originates while merging two or more databases. A particular field may take up different valid values in two databases and when we merge the two, we are stuck with correct yet inconsistent data. For e.g. in a database, ‘Gender’ field may be filled with either ‘Male’ or ‘Female’ values. In another database, the same field may have values as ‘M’ and ‘F’. When we merge the two data sources, the common field (Gender, in tis case) would have inconsistent values implying the same logical meaning (‘Male’ & ‘M’). In databases where validations are not implied, the same problem may arise due to input inconsistency, even though we may not be merging data from two different sources.

The only precaution that can be taken in this case is paying attention to fields which take up values from a predefined set. Various Visual Data Analysis tools have the functionality to check the consistency using one way or another. When this anomaly is found, there must be a ‘transformation’ action that changes one value into its valid counterpart, so that a single value exists for a single logical implication.

Format Inconsistency

Consider a field which stores ‘Employee ID’ in a company’s database. This field is generally a mix of alphabets and numbers, such as ‘SW-12345’ & ‘HR-1202’; but it might also be purely a numerical value, such as ‘12345’. It is always a good practice to store IDs as ‘text’ because these fields are not meant for any algebraic operations. But it may happen that someone, due to some reason, decides to store the IDs as numbers, may be due to an incremental nature of assignment in these fields. Hence, while merging data from two different sources, especially while creating joins in databases, the fields end up in a mismatch. The text ‘12345’ won’t match the number 12345, even though they are the same values to the naked eye.

It is best to avoid such inconsistencies through following best practices, but if the this does occur, the format must be corrected, either using a program or some other type-casting functionality available in most data wrangling tools. We will discuss different Data Types in my next post to understand this topic in detail. The bottom line being, Format Inconsistency must be avoided as they have a long lasting effect on your visualization.

Data Redundancy

While joining data from two different sources, there may be a field which is common in both datasets but have different field labels. For e.g. a field storing the phone number of an employee may have a label ‘Phone’ in one and ‘Contact_Number’ in another. Although the values stored in the fields may be the same, the different labels mean that when we create a join, we will end up having same information in two columns of the joined dataset. This is called data Redundancy. There is another kind of redundancy where ‘Bill amount’ may simply by the sum of product of ‘Item Price’ and ‘Item Quantity’. So, instead of copying the data in a large dataset, we might communicate the formula and the destination machine can regenerate the values at its end. This saves up querying time (reading) but increases computation complexity. Generally, if the data visualization solution relies on a database connected over the Internet, querying abilities and refreshing a large dataset are a bottleneck.

In environments where big data needs to be queried and refreshed to update a visualization, redundant fields must be identified (there are mathematical correlation algorithms to do just that) and an appropriate tactic should be applied which results in the best system throughput. The implementation depends on situation to situation. A simple solution, in the above example of phone number, may be to ignore one of the duplicate fields and use only one for all uses.

Unbalanced Data set

Data sets are prone to be skewed or unbalanced towards a certain case than other. For e.g. in a student database, majority of the students may be ‘Male’ students. Or, most houses in a certain locality may be priced above a certain threshold of price. When the objective of the analyses is to create a general model, to understand the causality within the data set, this becomes a problem. For e.g. if we want to derive what drives the price of a house in a certain locality and target a data set having records from only one locality, it may not yield a good model or a good citywide representation of features-vs-price of houses.

To find if the dataset is unbalanced or not, one may use techniques such as Histograms, Means, Medians, Standard Deviations etc. Also, sometimes it is a good idea to plot the data to simply visualize the relationship between two fields. While creating prediction models out of a very large dataset, one must consider equal number of records from different categories; such as in the above ‘House-Pricing’ example, equal number of records must be considered for different localities for an unbiased city-wide model.

These are some of the essential points to keep in mind for Preprocessing the data before proceeding with data visualization. There are more advanced and mature mathematical models to detect and remove some of the discussed anomalies in data and a must-read if you are a professional data analyst. There are a host of statistical tests available and tools such as ‘R’ and Matlab make use of them.

Vivek Shrivastava

8 min

3 cards

Data Visualization : Brief Overview & Best Practices

Why Visualize?

Data Preparation

Read “Data Visualization : Brief Overview & Best Practices” on a larger screen, or in the Medium app!