
In my previous publication, I introduced you to my set of tools used for data science along with some tips to install that. Now that we are all set, I want to introduce you to some basics I learned from my different trials and numerous mistakes. I will go on with the do and don’t concerning data science. I am no expert but these are fundamentals that will help you better understand the process of establishing a solid dataset and finding significant results, rather than spending numerous hours explaining inconclusive results.
I have been here, don’t advise you.
What I should not do
Analyzing data without a goal in mind
Do you ever go to an interview without a goal in mind? Did you get the job when you didn’t set a goal? I can tell you that if you answered yes to both question you are really lucky or a real badass. And for data science it is the same approach. You cannot dive into your data set and simply apply all the different techniques of data science you know to find some results. This approach is too naive and has two main inconvenience:
- Inconclusive results: You have a high probability to find results that are irrelevant or inconclusive to your analysis. You cannot even raise a conclusion as you did not raise an hypothesis to push you observations.
- Time consuming: For dataset with a small number of features along with a small need of normalization and cleaning, you will surely be fine. But as you could have guessed, you will deal most of the time with multi-variant set of features whose let alone listing would give headaches to the bravest. Therefore if no planning nor scoping is established, be sure to lose a lot of precious hours mingling with dirt you wanted to turn into gold but remained dirt.
Not cleaning your messy data

No data is perfectly clean until you clean it. Consider data as thrift shop clothes. They sure look good for their price (kaggle.com has many datasets for free), but they need to be cleaned before been used. When talking about cleaning, I refer to removing NaN values, empty cells, incoherent data, incomplete data and data that will skew you analysis. This is a lot to remove or replace but it definitely improve the overall quality of your analysis. Refusing to clean you dataset leads to two problems:
- NaN and empty cells can just not be processed and will result in exceptions in your processing. You do not want to waste precious minutes because of an empty cell
- Incoherent and incomplete data will always skew your analysis. In the case you are worthy to the glorious house of text mining, you will become familiar with the disdain we have concerning sentences that do not make sense and sentences that are either too long or too short. They generally skew you analysis and produce noise that will affect any clustering or regression analysis.
Diving into data modeling

Whenever we receive a new dataset, we absolutely want those grandiose plots and curves to plaster our screen with their mighty colors but halt there. Not all curves are pleasant to everyone eyes, and some are pure abomination. Plotting your data should be the conclusion of your analysis, not a standing part of your observation. Plots are there to give meaning to the numerical results that were established. A simple example would be to plot a graph that is supposed to represent a linear regression and end up being a kid doodle. Only interesting numerical results should plotted to give a better insight of their connoted meaning. Plotting a scatter chart with a correlation coefficient of 0 is useless, but plotting one with a correlation coefficient really close to 1 is relevant. This is the same order of things when dealing with clustering. Before plotting your cluster, always refer to the results of your cluster validation. A cluster that does not validate is a bad cluster and hence should not be plotted. Do not be afraid to not have anything to plot. Sometimes it is better to tell that the dataset is inconclusive that have incoherent plots.
HARDCODING
As surprising as it can be, most of the datasets you will work with are either in CSV, XML or SQL. Python provide a wide set of tools to read those files and get relevant values. Once those values have been gathered, you can have fun. But hold your horses and think about this principle or object oriented programming I worship and love: Reusability. The datasets are not the same, nevertheless, they are matrices and all follow the same rules as matrices. The only variations can be their features and the amount of data. But they are more important in the testing phase. It is not a must but as a programmer, you should. Never hardcode features into your code. Having hardcoded features lead to three main problems:
- Undermining reusability: The ability to apply a similar processing to a new dataset is a great advantage if your work consists in analyzing data from different companies working in the same field. Don’t work hard, work smart
- Code correctness and features discrepancy: In case of change of a particular feature in the system, such as a reformatting of the database system of a company or a slight change in the naming of the features, your whole code, or even your whole system will be paralyzed by a difference between the hardcoded features and the new features. This can cost a lot of time and also a lot of money. Don’t work hard, work smart.
- It is ugly and unprofessional. Just… don’t do that.

Not understanding your features
Thanks to @blockchaindude for this relevant point I found while doing some research. Having a set of 200 features can be challenging. But at the end of the day, you must be able to understand those features and know which one has a higher business value or greater real life impact. As I said earlier, do not dive in your analysis without preparation. A beautiful analysis full of high accuracy results can be awesome, but becomes irrelevant if the data analyzed has no real life impact. Let’s suppose we have a the following set of features: Percentage of population with degrees, Industrial demand, Coastline Length and GDP. I would not be wise to analyze the three first and leave the GDP to waste. Knowing the importance of each feature gives an orientation to the scheduling and the content of the data analysis. I advise you to read @blockchaindude article at this link:
https://hackernoon.com/12-mistakes-that-data-scientists-make-and-how-to-avoid-them-2ddb26665c2d
Now that we know what not to do, Let’s do things
Do it like you mean it
Find the right data
What would be data science without data. It would be science, but science still needs data. So let’s find some data. You can find useful datasets on kaggle.com. Some are already cleaned (But still clean, we never know). There is a wide range of data available and a lot of them are of reasonable size for a short period analysis. For my case I got a dataset called Countries of the World from Kaggle, a 13kb CSV file providing important informations concerning more that 200 countries in the world. You can find it at this link:
https://www.kaggle.com/fernandol/countries-of-the-world
Once you find data, always think about what you will do with those data. Some data are related (population, area, population density) and do not need to be analyzed together. Some are related (Industry, Agriculture, GDP) and should be our center of interest while producing an analysis for the dataset.
Start with the good habits
As said earlier, always remember to parse and clean the dataset to remove incoherent and incomplete data before processing. Failure to do so will result in improper results. Here is a sample module used to parse the CSV files and clean the data before processing. This is not a perfect cleaning. Also, the cleaning depends on the type of data you are working with. Hence the need to define you own cleaning functions:
# The goal of this file is to provide a cleaned CSV file
# This means that every record containing blank spaces will be neglected
# The new clean version of the file will be saved after completion of this first step
import pandas as pd
def readCSVfile(filename, columnNames, delimiter=",",separator=",", header=1 ,lowmemory=False):
data = pd.read_csv(filename, names=columnNames, delimiter=delimiter, header=header, low_memory=lowmemory,error_bad_lines=False)
return data
# The next part consists in replacing the blank values in records by 0
# To do so we just have to go across the dataset and replace empty cells by the value 0
# Then we replace the columns by their appropriate type
def reformatDataset(dataset):
for column in columns:
dataset[column] = dataset[column].fillna(0) # We first replace the different empty spaces by 0s
for index, row in dataset.iterrows():
for column in columns:
rowValue = str(row[column]).replace(",",".") # Then we replace the commas but periods for english
try: # representation
valNumber = float(rowValue)
rowValue = valNumber
except Exception:
pass
dataset.at[index , column] = rowValue
return dataset
# We want to group the results by region
# the newly arranged dataset
def rearrange_dataset(dataset,colname):
sorted_result = dataset.sort_values(by=colname) # colname represents an array depicting the different cols used in the sorting
return sorted_result
# Now we rewrite the whole CSV file for future uses
def rewriteCSV(dataset):
dataset.to_csv("files/clean_countryballs.csv")Some important tools you might need for data mining are listed in my previous article:
https://medium.com/@tibakevin/my-simple-tool-set-for-data-science-62c9d2001b9b.
Make sure to take a look at them and have yourself set for the analysis.
Get our data visualized
After understanding that some data have a certain correlation, it is important to provide a visualization to them. Here is the code used to give a representation of the data on a 2D bar chart.
# This bar chart will translate the correlation between the population density, the infant mortality and the deathrate
# The plot will use the population density for the X axis and the infant mortality and deathrate as the y axis
# Conveniently, this function also works for other 3 elements comparisons and will be used for other representations
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
from cleaning.csv_parsing import *
def getBarChartPlot(dataset,xValue,yValue,lineValue):
xValues = getXObjectsDefinition(dataset,xValue)
y_pos = np.arange(len(xValues))
yValues = [i[yValue] for index, i in dataset.iterrows()]
lineValues = [i[lineValue] for index, i in dataset.iterrows()]
# We define the first subplot for the left side measurement
f = plt.figure()
ax1 = f.add_subplot(111)
ax1.tick_params(axis='x',bottom=False,labelbottom=False)
line1 = ax1.bar(y_pos, yValues, align='center', label=yValue)
# We define the second subplot that will define the right value if present
if lineValue != "":
ax2 = f.add_subplot(111, sharex=ax1, frameon=False)
line2 = ax2.plot(y_pos, lineValues, color='red', linewidth=0.5, label=lineValue)
ax2.tick_params(axis='x', labelsize="small")
ax2.yaxis.tick_right()
ax2.yaxis.set_label_position("right")
plt.legend((line1[0], line2[0]), (yValue, lineValue))
plt.xticks(y_pos, xValues, fontsize=4, rotation='vertical', animated=True)
plt.title(str(yValue)+" | "+str(lineValue)+" : "+str(xValue))
plt.savefig("../data_visualization/plot_"+str(xValue)+"_"+str(yValue)+"_"+str(lineValue)+".png")
plt.show()
def getXObjectsDefinition(dataset, xValue):
# Here we associate the different countries with the population density
# The countries will be ordered by population density for this representation
objectDefinition = []
popDensitySet = rearrange_dataset(dataset,xValue)
for index, country in popDensitySet.iterrows():
value = str(country[xValue])+" : "+str(country["Country"])
objectDefinition.append(value)
return objectDefinitionThen we proceed to our test file:
# This test is set for the visualization of a 3 elements comparison
# The main graph type used is the PDIMDR
from data_visualization.charts.PDIMDR import *
result = readCSVfile("../cleaning/files/countryball.csv", columns) # First we read our CSV file
data_cleaned = reformatDataset(result) # Then we clean and reformat it
data_ordered = rearrange_dataset(data_cleaned, ["Region","Country"]) # After that we rearrange considering Regions then Country
getBarChartPlot(data_ordered,"Industry","PopDensity","GDP") # Finally we print our graph
Another way of interpreting our data is using clustering to associate elements having similarity based on their features. Clustering is an unsupervised learning technique whose goal is to associate elements based on their spatial position, density and linkage to define new grouping that can be further interpreted. K-means is one of the simplest form of clustering that simply uses k centroids and iteratively associates different data points to each centroids based on distance until the centroids displacement is lower than a certain threshold. Here is the code set for our k-means clustering:
# This clustering will be applied for 3D scatter plots in order to classify the data sets
# The number of cluster is independent of the number of elements.
# Nevertheless, supplemental information concerning the quality of the graph will be added
import numpy as np
import torch
from sklearn.cluster import KMeans
# First we define a function that will help defining the different clusters
# It requires the dimension and the number of clusters
def defineCentroids(dimension=3,numberClusters=5):
centroids = torch.rand(numberClusters,dimension)
return centroids
# Then we define a function that will associate a color to each group base on the cluster
# This is our main clustering function
def clusterize(xValues,yValues,zValues,numberClusters=5):
points = np.array(getPointsDefinition(xValues,yValues,zValues))
kmeans = KMeans(n_clusters=numberClusters,random_state=0,max_iter=1000,tol=0.01).fit(points)
return kmeans
# This function enable to set points on a 3-dimensions graph
def getPointsDefinition(xValues,yValues,zValues):
pointsDefinition = []
for i in range(len(xValues)):
newPoint = [xValues[i],yValues[i],zValues[i]]
pointsDefinition.append(newPoint)
return pointsDefinitionOur test file:
# This test is set for the visualization of a 3 dimension clustering
# The main graph type used is the TDIMDR
from data_visualization.charts.TDIMDR import *
from cleaning.csv_parsing import *
result = readCSVfile("../cleaning/files/countryball.csv", columns) # First we read our CSV file
data_cleaned = reformatDataset(result) # Then we clean and reformat it
data_ordered = rearrange_dataset(data_cleaned, ["Region","Country"]) # After that we rearrange considering Regions then Country
get3DScatterPlot(data_ordered,"Coastline","Industry","GDP",clusters=5)
Outro
Understanding what you have to do and planning your analysis is an important part of data science. You must know your features and their importance to dig out the import facts and the crucial patterns necessary to boost your work. The goal of data science is to gather the data necessary to improve performance and reduce cost. Your work as a programmer is to design and implement precise and conclusive analysis to reach the predefined goal. The rules listed are based on a personal experience as a research assistant dedicated to text mining. Next time we will talk about a simple tool called Jupyter and how to use it efficiently to display step by step analysis of your data. See you soon.
