First try at data science

The data and the science

Kevin Tiba
Aug 31, 2018 · 10 min read

In my previous publication, I introduced you to my set of tools used for data science along with some tips to install that. Now that we are all set, I want to introduce you to some basics I learned from my different trials and numerous mistakes. I will go on with the do and don’t concerning data science. I am no expert but these are fundamentals that will help you better understand the process of establishing a solid dataset and finding significant results, rather than spending numerous hours explaining inconclusive results.

I have been here, don’t advise you.

What I should not do

Analyzing data without a goal in mind

Do you ever go to an interview without a goal in mind? Did you get the job when you didn’t set a goal? I can tell you that if you answered yes to both question you are really lucky or a real badass. And for data science it is the same approach. You cannot dive into your data set and simply apply all the different techniques of data science you know to find some results. This approach is too naive and has two main inconvenience:

  • Inconclusive results: You have a high probability to find results that are irrelevant or inconclusive to your analysis. You cannot even raise a conclusion as you did not raise an hypothesis to push you observations.
  • Time consuming: For dataset with a small number of features along with a small need of normalization and cleaning, you will surely be fine. But as you could have guessed, you will deal most of the time with multi-variant set of features whose let alone listing would give headaches to the bravest. Therefore if no planning nor scoping is established, be sure to lose a lot of precious hours mingling with dirt you wanted to turn into gold but remained dirt.

Not cleaning your messy data

No data is perfectly clean until you clean it. Consider data as thrift shop clothes. They sure look good for their price (kaggle.com has many datasets for free), but they need to be cleaned before been used. When talking about cleaning, I refer to removing NaN values, empty cells, incoherent data, incomplete data and data that will skew you analysis. This is a lot to remove or replace but it definitely improve the overall quality of your analysis. Refusing to clean you dataset leads to two problems:

  • NaN and empty cells can just not be processed and will result in exceptions in your processing. You do not want to waste precious minutes because of an empty cell
  • Incoherent and incomplete data will always skew your analysis. In the case you are worthy to the glorious house of text mining, you will become familiar with the disdain we have concerning sentences that do not make sense and sentences that are either too long or too short. They generally skew you analysis and produce noise that will affect any clustering or regression analysis.

Diving into data modeling

Whenever we receive a new dataset, we absolutely want those grandiose plots and curves to plaster our screen with their mighty colors but halt there. Not all curves are pleasant to everyone eyes, and some are pure abomination. Plotting your data should be the conclusion of your analysis, not a standing part of your observation. Plots are there to give meaning to the numerical results that were established. A simple example would be to plot a graph that is supposed to represent a linear regression and end up being a kid doodle. Only interesting numerical results should plotted to give a better insight of their connoted meaning. Plotting a scatter chart with a correlation coefficient of 0 is useless, but plotting one with a correlation coefficient really close to 1 is relevant. This is the same order of things when dealing with clustering. Before plotting your cluster, always refer to the results of your cluster validation. A cluster that does not validate is a bad cluster and hence should not be plotted. Do not be afraid to not have anything to plot. Sometimes it is better to tell that the dataset is inconclusive that have incoherent plots.

HARDCODING

As surprising as it can be, most of the datasets you will work with are either in CSV, XML or SQL. Python provide a wide set of tools to read those files and get relevant values. Once those values have been gathered, you can have fun. But hold your horses and think about this principle or object oriented programming I worship and love: Reusability. The datasets are not the same, nevertheless, they are matrices and all follow the same rules as matrices. The only variations can be their features and the amount of data. But they are more important in the testing phase. It is not a must but as a programmer, you should. Never hardcode features into your code. Having hardcoded features lead to three main problems:

  • Undermining reusability: The ability to apply a similar processing to a new dataset is a great advantage if your work consists in analyzing data from different companies working in the same field. Don’t work hard, work smart
  • Code correctness and features discrepancy: In case of change of a particular feature in the system, such as a reformatting of the database system of a company or a slight change in the naming of the features, your whole code, or even your whole system will be paralyzed by a difference between the hardcoded features and the new features. This can cost a lot of time and also a lot of money. Don’t work hard, work smart.
  • It is ugly and unprofessional. Just… don’t do that.

Not understanding your features

Thanks to @blockchaindude for this relevant point I found while doing some research. Having a set of 200 features can be challenging. But at the end of the day, you must be able to understand those features and know which one has a higher business value or greater real life impact. As I said earlier, do not dive in your analysis without preparation. A beautiful analysis full of high accuracy results can be awesome, but becomes irrelevant if the data analyzed has no real life impact. Let’s suppose we have a the following set of features: Percentage of population with degrees, Industrial demand, Coastline Length and GDP. I would not be wise to analyze the three first and leave the GDP to waste. Knowing the importance of each feature gives an orientation to the scheduling and the content of the data analysis. I advise you to read @blockchaindude article at this link:

https://hackernoon.com/12-mistakes-that-data-scientists-make-and-how-to-avoid-them-2ddb26665c2d

Now that we know what not to do, Let’s do things

Do it like you mean it

Find the right data

What would be data science without data. It would be science, but science still needs data. So let’s find some data. You can find useful datasets on kaggle.com. Some are already cleaned (But still clean, we never know). There is a wide range of data available and a lot of them are of reasonable size for a short period analysis. For my case I got a dataset called Countries of the World from Kaggle, a 13kb CSV file providing important informations concerning more that 200 countries in the world. You can find it at this link:

https://www.kaggle.com/fernandol/countries-of-the-world

Once you find data, always think about what you will do with those data. Some data are related (population, area, population density) and do not need to be analyzed together. Some are related (Industry, Agriculture, GDP) and should be our center of interest while producing an analysis for the dataset.

Start with the good habits

As said earlier, always remember to parse and clean the dataset to remove incoherent and incomplete data before processing. Failure to do so will result in improper results. Here is a sample module used to parse the CSV files and clean the data before processing. This is not a perfect cleaning. Also, the cleaning depends on the type of data you are working with. Hence the need to define you own cleaning functions:

Some important tools you might need for data mining are listed in my previous article:

https://medium.com/@tibakevin/my-simple-tool-set-for-data-science-62c9d2001b9b.

Make sure to take a look at them and have yourself set for the analysis.

Get our data visualized

After understanding that some data have a certain correlation, it is important to provide a visualization to them. Here is the code used to give a representation of the data on a 2D bar chart.

Then we proceed to our test file:

Comparison plot among the industry, population density and GDP

Another way of interpreting our data is using clustering to associate elements having similarity based on their features. Clustering is an unsupervised learning technique whose goal is to associate elements based on their spatial position, density and linkage to define new grouping that can be further interpreted. K-means is one of the simplest form of clustering that simply uses k centroids and iteratively associates different data points to each centroids based on distance until the centroids displacement is lower than a certain threshold. Here is the code set for our k-means clustering:

Our test file:

Clustering based on Coastline, Industry and GDP

Outro

Understanding what you have to do and planning your analysis is an important part of data science. You must know your features and their importance to dig out the import facts and the crucial patterns necessary to boost your work. The goal of data science is to gather the data necessary to improve performance and reduce cost. Your work as a programmer is to design and implement precise and conclusive analysis to reach the predefined goal. The rules listed are based on a personal experience as a research assistant dedicated to text mining. Next time we will talk about a simple tool called Jupyter and how to use it efficiently to display step by step analysis of your data. See you soon.

Kevin Tiba

Written by

Elegant researcher💡Avid learner 📚and crazy coder💻. That's all you need to know :)

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade