Statistics for Data Science I: Measures of Central Tendency Using R and Python

Rahime Yeşil
Data Runner
Published in
5 min readJun 5, 2020

The historical development of statistics and its applications are not only essential for data science but also business development processes ,decision-making periods and other processes needed.But, you know if you want to good at data science field, you need to learn statistics very well. End of the day, many data scientists, data analysts apply statistical methods to understand the data…

This story will be first serie of statistics for data science and I will tell you about measures of central tendeny using Pyhton and R with usage tips.

Measures of Central Tendency

A measure of central tendency is a summary statistics that attempts to describe an entire dataset with a single value that represents the centre of its distribution. Due to it delivers a comprehensive summary of the whole dataset,it is one of the most essential concept in statistics.

Well, how do we measure of the central tendency ? Here are the 4 ways to measure it :

Measures of Central Tendency

I would like to start this serie which will be first serie of Statistics for Data Science with a dataset called Married at First Sight(click for see dataset).

First thing we will do download the necessary libraries for Pyhton and R. Second, we should observe the variables, observation values and data types :

Python

To display information about dataset we will use info() function in python.

Python Codes
Output for Python

R

To display information about dataset we will use str() function in R.

R Codes
Output for R

Third, we should select a variable to calculate measure of central tendency, I would like to select “age” variable for this serie :

Python

With iloc[] function I select the data that indicated index in dataset :

Python Codes

R

To select variable in dataset we use $ symbol between dataset and variable that we want to select :

R Codes

We’ve imported the libraries we need ,explored the variables, observation values in dataset and selected “age” variable to calculate central tendency till now. It’s time to start !

Mean

Shortly,we can describe mean as the “average” number of data points that calculated by adding all data points and dividing by the number of data points. If you have outlier values in your dataset, you shouldn’t prefer mean for measure central tendency. We prefer using mean for normal distribution.

Python

Due to age variable’s type is numpy array, we use np.mean() function to calculate mean in Python.

Python Codes for Mean

R

we use mean() function to calculate mean in R.

R Codes for Mean

Median

The middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers). If distribution is skewed distribution,we use median instead of mean.

Python

Due to age variable’s type is numpy array, we use np.median() function to calculate median in Python.

Python Codes for Median

R

we use median() function to calculate median in R

R Codes for Median

Mode

Shortly we can describe it as the most frequent number in data point or dataset.

Python

To calculate mode ,we use stats.mode() in Python.

Python Codes for Mode

R

we use mfv() function to calculate mode in R.

R Codes for Mode

Quartiles

Instead of looking out all of the data,quartiles are values that divide a sample of data into four equal parts known as Q1,Q2 and Q3.. Quartiles are especially useful when you’re working with data that is normal distributed, or data point/dataset that has not outliers.

Python

Python Codes for Quartiles

For this example Q1: 27 , Q2: 30 (median value) and Q3: 32

R

R Codes for Quartiles

when we look at output of R , Q1: 27 , Q2: 30(median values) and Q3: 32 other values 0% is 24 which means minimum value and 100% is 37 which means maximum value in dataset.

Bonus

The good news is you dont need to code one by one to learn central tendency of data,only with one function you can observe all of them except mode :)

Python

After we convert age variable numpay to dataframe, we use describe() function to observe central tendency of data.

R

We use summary() function to learn central tendency of data.

To view all Python codes that used in this story (click here)

To view all R codes that used in this story (click here)

I hope this story help you to learn how you will measure the central tendency of data using Python and R , in my next serie I will be talking about central distributions. Till then keep learning ! and keep following → Data Runner

--

--