A Simple Introduction: The Basics of Data Science

Silen Naihin
Future Vision
Published in
6 min readApr 14, 2019

Data science. Two words that shouldn’t work together, but they beautifully do.

On a very basic level, computer science, math, and statistics decided to get married, and have a baby. That baby was called data science.

There are many interesting things going on in the field of data science such as (buzzword incoming) machine learning, which is in turn, a subset of artificial intelligence, which is a subset of data science. In ‘machine learning’, the learning aspect means that the algorithm depends on data, which is used by a training set to fine-tune what is called a ‘model’ or the AI. To be able to do machine learning, you have to know data science first and foremost.

Machine learning has so many amazing applications. It is changing the world already today.

Ever heard of Siri? It is completely powered by AI, and uses a method called NLP (natural-language-processing).

Artificial intelligence is beginning to be used in the medical industry too. It can diagnose patients with tumors for example, using a method of machine learning called CNN’s (convolutional-neural-networks).

Have you ever heard of The Terminator?

Just kidding, artificial intelligence is nowhere near that advanced… yet. One day there is a possibility that we will become AI’s slaves, but that’s a rabbit hole meant for another discussion.

This article will stick to the absolute basics, introducing you to fundamental concepts in data science.

Data Scientist Workflow

You start by defining the problem you are working on. This could be optimizing supply chain logistics or predicting the house prices in a certain area. Then you go out and collect your data, what the mileage of the truck is, what the price of the truck is, what the delivery time is, etc.

There are two types of data. Continuous, and discrete. You can count discrete data. It’s more like layers — there is no in between. Discrete data includes colors, food, travel destinations and the occupation you are in. With continuous data, it is infinite, with values in between. Things like height, weight, age, and house prices.

Data collection, is where 70% of a data scientists time is spent (there’s also datasets available online from Kaggle and other sources where the data collection is done for you).

Once you have collected that data, you go through the data and remove all of the flaws. You separate the noisy data (inaccurate data, outliers) from the informative data. You fill in the blanks where there is missing data using several techniques like mean imputation.

Lastly, you run the data through a neural network and the machine learning model learns off of the data that you provided. Garbage in, garbage out. If you feed the model a bad dataset, then the model will be trained inaccurately or badly.

Plotting Data

When you plot the data, you see a better relationship between the data. This helps visualize and draw conclusions from the data. There are a few statistical concepts that are useful here.

Central tendency is an extremely complicated concept that is sure to scare you away from data science. Jokes, it’s just a fancy word for the mean, median, and mode. Yea, that stuff you learned about in grade 4.

Those definition may be a little bit tricky, so just think of it as the middle of a Bell curve on a graph.

Sometimes, the median doesn’t match the mean. This means that the distribution is not perfectly bell shaped, and could be skewed to one side.

Another important concept in data science is dispersion, which is the variability (scatter and spread are also used) of the data.

First we have range. This just means the lowest and highest values in the distribution. Think of a number line from 2 to 342. The range would be 340 numbers. Hand in hand with this comes interquartile range. Think of it as a certain subset of that range. For example 100 to 205 could be an interquartile range of 105. Interquartile data is useful because of its resistance to outliers, which can skew the data wrongly.

Then we get into the more heavy duty stuff. Sample variance is used for mathematical calculation in actual algorithms.

Say you collected American people’s weights for a dataset that you need. It would not make sense from a time and monetary point of view to measure the weight of every single person in a population. The workaround, is taking a sample of the population like 1000 people, and using that size to estimate the weights of the whole population. Variance helps you figure out how spread out those values are.

X is the sum of all of the observations. X with a hat over it is the average of all of those observations. N stands for the number of observations. The reason why you need to divide by N-1, is to remove bias in a sample. If there is a fixed amount of people that partake in the sample though, only N needs to be there.

The next concept is sample standard deviation, and is used to understand what is actually happening in the data. We can calculate a standard way of knowing what is normal, but also what is large or small. The easiest way to describe this calculation, is just the s for the standard deviation but square rooted.

Just to portray how simple these equations actually are, here is a picture that might help you understand these concepts better.

There is also something called covariance. This calculates how far away a point you plot on the graph is away from the X axis mean and the Y axis mean. This is used when you have more than one variable, vs the concepts mentioned previously, which account for only 1 variable.

Covariance can also be thought of as correlation, and by applying the formula, you are correlating the two X and Y variables.

The formula basically gets the area under the two lines, and give you an output. The smaller the output number is to 0, the less correlated two variables are. This tells you how much two random variables are related to each other. A fun game to play to test your understanding of this concept is http://guessthecorrelation.com/. Try the multiplayer mode with a friend!

After this comes algorithms like linear regression and logistic regression. What introduced in this article are the most basic prerequisites to machine learning. It really is quite simple when you finally understand it!

Key Takeaways:

  • Machine learning is going to have a huge impact in the future
  • The central tendency, is the mean, median, and mode of some data
  • Standard deviation and correlation helps you calculate how related, or unrelated the variables are

Thanks for reading!

I would really appreciate it if you could:

  • Clap if you liked the article or learning something
  • If you didn’t clap, let me know why by giving me some feedback
  • Follow me on LinkedIn and Medium (Silen Naihin)

--

--