Why should you learn Coding? 3 simple data examples to blow your mind

Yash Gupta
Data Science Simplified
11 min readJan 11, 2022

The era of pen and paper are long gone. It is the digital age today and anything we do has something to do with data. Gone are the days when organizations require manual labour in tasks relating to the mind. Everything can happen on a desktop or laptop today. What required millions of pages of documenting in the 20th century has become a mere pen drive in the 21st century. Well, we’re beyond pen drives too; Cloud computing and storage have made it possible for you to store unlimited data on servers worldwide from which you can securely access your data using just the internet.

Let’s get back to the point in consideration now. When SO MUCH has changed with the growth of the digital age, it’s hard to keep up with the revolutionizing things around us. It’s a given that all of us have heard of the term ‘Coding’ everywhere around us, no matter which field we belong to; be it Economics, Statistics, Math, Engineering, Biostatistics, etc. \

Data is everywhere around us and so is coding.

Coding was a term that was limited to the dimensions of engineering until a couple of years ago and before we know it, organizations have been investing in it for children to study coding at an age of 10!

If you're still skeptical as to why has this boomed so much and why is it so important to learn to code, this article will give you a small glimpse of its capabilities and get you thinking. We’ll go over examples using one of the most versatile coding languages there is today, Python.

For starters, Python is an open-source, general-purpose coding language that first appeared 30 years ago and has been used for small-scale and large-scale projects all around the world. Thus, it has been around for quite a while and according to Github, there are 350000+ contributors to the language, which highly talks about how it is growing at a rapid pace.

Without any further adieu, let’s get right into it.

Example I: Statistics

All of us have been through the basics of statistics from our school. We’ve all learned how to calculate mean, median, mode. Let’s take this level by level and see how it works today.

  1. Say you have to calculate the mean of the following numbers: 1, 2, 3, 4, 5

    It’s pretty simple, you do 1+2+3+4+5 = 15 and divide it with 5, and voila! The mean is 3.

2. Say you want to calculate the mean for 100 random numbers between 50 to 60. You’re gonna take a minute and tell me, well that isn’t too hard. You’ll probably take around 5 minutes, get your calculator, calculate the sum and double-check it, and have the answer or if you’re tech-savvy you head to excel and make a simple function if you know how to do it and get it done within a minute and have the answer.

3. Now let's say you have to do this for 20000 numbers and some decimals involved. You’ll probably tell me that well, Excel supports 1048576 rows so it shouldn’t be a problem to do that on Excel. It’s pretty much not a good decision to try using a calculator in this case and yes, you’re right. Excel works here too.

4. Let’s take it up another notch, you have to do this for 20000+ numbers over 8 columns of information, so make it roughly 160000+ numbers. Well, straight up, Excel can still do it!

Excel is super-awesome, no doubt about that, but you’ll see my point soon.

5. Let’s not drag this any longer. I want you to think about how long it’ll take you to calculate the mean, median, count, standard deviation, quartiles, minimum and maximum for all the 20000+ rows in each of the 8 columns accurately.

If you’re a professional with Excel, I’m pretty sure it’ll take you at least 10 minutes to type out all the functions you’ll need for this and another 1 minute to run it in the spreadsheet.

It’s pretty much impossible to accurately on a paper considering the time effort and human error.

It’s not feasible to try it on Excel either.

Think of any other possibilities…

If you’re hoping for something cool to come around now, well it is. I’ll show you just how easy it is if you use Python.

We’ll use the California housing data with 20640 rows in 8 columns for our example. The Pandas library in Python is what we’ll use.

Here’s what the data looks like (for which we use another pandas code, df.info() and df.head()):

As for our task at hand, which is to compute all the different values for all the columns, all we need to type in is…

df.describe()

where df is the name of our table or data frame and describe is what we want Python to do for us. In layman’s terms, we’re just telling python to describe the data in this table for us.

It takes less than a second to run and we can see how beautifully we get back the information we need.

It does not get easier than this. If you’ve ever used pen and paper to identify the mean, median, count, standard deviation, quartiles, minimum and maximum of any given set of numbers, even as less as just 10 random numbers, you know what it takes.

If you think that it’s still not giving you the ‘wow’ factor… let’s try this. Let us try to find out the correlation between all the columns in the same dataset.

If you’re new to what correlation is, it’s the degree of relationship between 2 variables. Here the columns are our variables and a correlation value close to 1 tells us that there is strong positive correlation and a value close to 0 tells us there is no correlation and similarly, a correlation value close to -1 tells us that there is a strong negative correlation. It does not imply causation but it still is one of the most important statistical measures when studying data. Without diving too deep into the details, let’s see the formula for finding the correlation.

Imagine using this for finding the correlation of 20640 X 8 observations accurately.

Essentially we want to find out how correlated is column 1 to column 2,3,4 to 8 and then column 2 to 1,3,5 etc. for all the 8 columns. Seems like a lot of math? It is.

In python, it’s as simple as typing

df.corr()

df being the dataset and corr being short for correlations.

Here’s how the output looks like;

This still is too many calculations and too many numbers to clearly understand. To see what the correlations are telling us… In python, using pandas and a beautiful data visualization library, ‘Seaborn’, it is as simple as typing…

sns.heatmap(df.corr(), annot = True)

where

sns = calling out the Seaborn library,
heatmap = the cool visual we want to prepare,
df.corr () = the data for which we want to see it and we specify it as correlations,
annot = True, suggesting that we want to see the labels on the heatmap itself

The heatmap here clearly tells us the correlations of each column against itself (1 for all in the diagonal because every column is perfectly correlated to itself) and the other columns. The magnitude of the correlation is given by the diverging colors and doesn’t need a lot of explanation.

Hopefully, you see how easy Statistics becomes with a couple of lines of code. The data in the real world is more than what you can understand using merely a pen & paper or Excel.

We’ll lay the statistics example to bed now and continue with our heatmap to build something cooler with Python.

For more on Pandas in Python:

Example II: Graphs

Like we’ve already witnessed, it’s pretty easy to make graphs on Python. What’s better about it is you can customize these graphs to suit your needs.

Let’s look at it too.

  1. Imagine you have to make a scatterplot for the following values,

X = 1,2,3 and Y = 2,4,6

Sounds pretty simple.

Using a simple hand drawing, you can pretty much make this in less than a minute.

2. What if we wanted to do this accurately for our California housing dataset and see the scatterplot for the Median House Value and the Median Income which seem to be pretty correlated with a value of 0.69.

You’re gonna tell me that it's 20640 X 2 numbers! Well, to python it’s still less than a second and a one-line code.

sns.scatterplot(x = ‘MedInc’, y = ‘MedHouseVal’, data = df)

where scatterplot is the visual we wish to see, x and y specify the column names from the data which is df.

The upward trend shows us that indeed there exists a slightly positive correlation and there are a few outliers in our dataset. Even then, this seems like something a coding language should be able to do after what we witnessed in the Statistics example…

Imagine doing this for all the columns parallelly.

Yes, all the columns, in a scatterplot against the other columns. It’s possible and with a simpler code. We don’t see the code here because the image covers it up on the screen due to its size.

sns.pairplot(df)

Depending on the size of the data, it can take anywhere from 10 seconds to a minute. But at the end of it, I think it's evident how awesome and powerful this language is.

Notice how the scatterplot we prepared before is at the bottom left of the image.

Increasing the difficulty by 100 times is still merely half a minute’s work for Python. This isn't the only visual you can create using python. Any visualization that exists, you can create using Python. Here are some examples from Seaborn. (Note: You can also create geographical maps, 3D plots, interactive plots, etc. on Python)

For more beautiful visuals using seaborn, check out their gallery at:

These are just a few of the infinite possibilities you have with Python in your toolbox. You can customize them and change them as you prefer. It’s only a few lines of code and some days of practice that it takes to get them right.

Example III: Clean, Manipulate, Transform

We’ve seen a glimpse of python’s capabilities when it comes to understanding our data and visualizing it. But to find data that is clean and workable with, it’s hardly the case in the real world. Cleaning data to use for deriving insights out of it is the most important job one has while working with data. We will see a brief overview of this example because of its large-scale application and considering that a separate article would be required to elaborate on all of it (which will come along eventually).

Python’s libraries like NumPy, Pandas, ScipyStats, etc. can help in cleaning, manipulation, and transformation of datasets as per the requirement of the users. Advanced libraries such as sci-kit learn help in splitting the data, making samples, imputation and encoding as well.

Numpy helps in the creation of vectors, matrices, and linear algebra which supports Python in its capabilities to perform the calculations as required.

Matrix multiplication which can be overcomplicated to conduct manually for a big matrix with rows > 5 and columns > 5, the code is as simple as

NumPy.matmul

Using Pandas, one can work with large datasets as they would do in spreadsheet software such as Google Sheets or Excel but in an easier and faster way.

Adding calculated columns or grouping data to find insights are as easy as…

df.groupby(“Outcome”) to groupby the entire data for the categories in the column ‘Outcome’

or to fill in missing values with pretty much any value of our choice…

df.fillna(10) with all missing values being replaced with 10

or to remove a column entirely…

df.drop(‘Column’, axis = 1) where ‘Column’ is the column name and axis = 1 tells python that we want to do drop the entire column vertically.

There’s a lot more to it but I hope this article provided you with the right motivation to go check out coding and learn it at the earliest.

For more on Data Cleaning with Python:

Coding is not difficult. There are multiple sources and documentation available online for anyone to get on with learning any language easily. It is difficult to catch up with it in a short span of time and it is pretty much about consistency and effort. You can choose from many languages from Python, R, SQL, SAS, Ruby, Java, JavaScript, Scala, DAX, M, etc. depending on your use case and industry of interest. It’s not difficult once you get the hang of it and you can take your skills to a different league with any of them in your knowledge.

For more such articles, stay tuned with us as we chart out paths on understanding data and coding and demystify other concepts related to Data Science. Please leave a review down in the comments.

This was a long article, thank you for reading it all the way.

For further information, drop a comment or just connect with me on my LinkedIn, the link to which is in my profile here!

--

--

Yash Gupta
Data Science Simplified

Lead Analyst at Lognormal Analytics and self-taught Data Scientist! Connect with me at - https://www.linkedin.com/in/yash-gupta-dss