An introduction to Data Science.

Steps to becoming a Data Scientist.

8 min readAug 29, 2018

Raise your hand if you’ve heard about Data Science… Put your hand down dummy, I can’t see it. Ever since we built our first computers, we’ve been generating an endless stream of data. This data mostly is disorganized and can’t do much but just sit there.

To take things into perspective, everything we do produces data. Information is what we live on, knowing how high or low the temperature is, that’s data. When you go to work, your journey produces massive amounts of data. Your speed, your route, how many turns you make, what you do on your way, did you grab a snack, how many people you cut off, how many times you cursed the other drivers, how much fuel you consumed, the car mileage, bumps hit… The list is almost endless. But the fact that you didn’t take this data into account means no one is using it for anything, it’s there but useless. And that’s where a data scientist comes in.

A data scientist is someone who can take a set of data, develop a use case for that data, create a hypothesis on how to make use of it, perform experiments using the developed hypothesis, analyze the results and come up with a solution.

Computers are the best artificial tools we have for data collection. Our brains are much better at it but our brain doesn’t work with mathematical concepts and we still don’t understand how it structures, organizes and makes sense of the data.

Machine Learning has accelerated the analysis-hypothesis-experiment loop and prior to ML, data scientists used to come up with these tactics manually. An ML algorithm is capable of developing thousands if not millions of hypothesis, run through billions of experiments and analyze thousands of conclusive results to determine the best solution. This may take hundreds of hours depending on the computing power of the host machine.

So now that you know the processes that are taken into consideration by a data scientist when handling data, let’s discuss them one by one starting with… drum roll, please…

Data Organization.

62% of your time is spent organizing data.

Well, I must be honest, this is a cooked up figure. But the point is, most of your time will be spent organizing data. Data doesn’t always come in cleanly, the way people enter data is almost always different. People organize data in a way it makes sense to them, and this may not always make sense to you as a data scientist. This means you have to go through the data and clean it up, organize it to be able to use the set.

There are different levels of data organization. The lowest level is with human readable data. This data is mostly from small businesses where you can be provided with an Excel spreadsheet depicting sales where you find people buying less at specific intervals, so you decide to check what could be the reason why. Maybe it’s the weather? So you decide to pull out information on the weather during the sales period, but the company does not have this data. You’ll have to mash up data from the weather channel to know why there’s some downtime.

The tools most suitable for this level one kind of data organization are Excel Spreadsheets, Power BI tools, SQL technologies and a good understanding of CSV files. Most of the time, the data is fairly human readable, clean, well structured and comes from a single source unlike level two where you combine data from different data sources. This technique is mostly used in small businesses data and well-designed experiments.

source: https://www.ncbi.nlm.nih.gov/nuccore/CY138640

Higher levels of data organization handle data that may not be human-readable to some extent. Consider gene sequencing data as the picture above shows, it’s fairly readable but cannot be as quickly understood as data on a spreadsheet. This data mostly is separated by commas to mark for columns and new lines to mark for rows. First, you need to understand what you want to do with the data so you can restructure it into useful data sets that can now be used to solve particular problems.

Tools useful for this kind of data sets are Power BI tools, Python with Pandas and Spark, R, Java, Scala and deep understanding ofthe subject knowledge. For instance, I’m not a microbiology specialist, so the genome sequencing process is something completely alien to me. This means I might not be well equipped to handle data related to gene sequencing since I don’t understand what the data represents even before I decide to organize it into useful pieces of data.

The skills required for data organization are data mashup skills where you can be able to pull data from different sources and mash it up together to make sense of it, a little bit of data intuition where when you look at your visualized data-sets you already have an idea as to how to approach it. You also need to be good at cleaning up data since you might find a data set where instead of a monetary value, one can input the value as a string or name of the value like “fourteen” instead of $14. Knowledge on the subject is also required, you cannot work with data that you don’t understand the purpose of.

Data Visualization.

Art is science made clear — Wilson Mizner.

Data visualization is using images as representations of data for other people to understand the data being presented in a clear, concise manner. This information can be represented as charts, histograms, maps that people have seen before and can be able to quickly identify key information without getting lost in the complexity of numbers. You might make very beautiful graphics, but if someone else cannot understand it then your presentation is as useless as the raw data.

There are two pieces of data visualization, first, is you having to understand the data and then you explaining the results of the data using a perceivable representation. There are some trends in data science where people are moving away to classical representations of data to a more machine learning based approach where you can have predicted behaviors derived from millions of data points. This depends highly on the dynamics of the data, the most common one being a normal distribution where data is represented in curves like in a histogram.

You can use Excel to quickly visualize data using the Excel chart tools, and Excel will create charts depending on how you decide to plot your data. You shouldn’t blindly use this tools, mostly you should understand the work behind it for you to be able to effectively manipulate this data. You can also use Tabular and some Power BI tools if you wish but as a caution: You need a proper understanding of these tools before you decide to use them.

More advanced tools are MatPlotLib, D3.js and R. Using these tools you can create interactive designs, core analytics methods and decision making graphics. This gives the users a grip of data that they may not have an understanding of before. To conquer this advanced level, Virtual Reality skills is a plus since some of your graphics models may end up looking as abstract as something like this…

In the future, VR will revolutionize data visualization where you can “walk in” into your data and observe it from a 3-dimensional perspective and be able to manipulate the data when you’re standing “within” the data.

Data Analysis.

Data analysis is using statistical skills and applying them to your data to come up with conclusive results.

We’ve developed statistical methods over the past 200 years, computers have brought analytics to a whole new level in just 10 using computational mathematics.

The most important thing in data analysis is a deep understanding of the tools developed that help tackle data. This is where you spend the least time on as you’ve already understood the data structure, visualized the data and are ready to derive useful information from the previous steps. Once your data fits a specific model, you can apply a number of statistical methods and algorithms to derive meaningful information from the model.

One tool you should be really competent with is the Excel Data Analysis toolkit which is available as an add-in. You can also use code, specifically machine learning techniques in Python using some of Python’s modules like pandas, numpy, matplotlib and matplotlib.pyplot to plot data. Machine learning has a huge advantage in data analysis over our classical skill sets where some large 2-dimensional data spaces can be handled faster by using machine learning algorithms. Here you can use advanced and more recent technologies like Microsoft’s Cognitive Toolkit and TensorFlow to attack some problems being faced. Your goal at this level should be able to build your own tools that can integrate with existing tools to solve particular problems.

Wrapping up…

As you can see, one of the best languages to understand as a data scientist is Python. And although TensorFlow just released an integration to JavaScript, most of this can only be run on the browser to do a limited set of tasks but very useful nonetheless. Once you can jump in and see the results of your data, then you can be able to make near accurate predictions using applied statistics. One thing that is important to know is that this is what Machine Learning is about, it’s not about making a computer that thinks like a human but follows patterns used by a human mind to come up with useful data from millions or billions of data points. That data doesn’t make sense by itself, you’ll have to go through the three processes to come up with useful results.