Minimal guideline to get descriptive statistics using Python.

Iván Andrés Trujillo

Ivan Andres Trujillo
7 min readAug 27, 2020

Universidad Surcolombiana — Facultad de Salud

source: Fedora workstation desktop gnome

The following Notebook, is a minimalist text, that aim introduce to new users, and students to get descriptive statistics that are frequently required in curricular courses and exploratory research. Python is an object-oriented language, therefore it is very easy use and write code to get descriptive statistics. Python allow us work with some important libraries that have been designed by other persons to handle several common tasks, in this text we will use only one of them and a built-in module.

Pandas it is a library designed to handling datasets, and it is a open source project.

Note: you must include in your script only the code that beginning with the header #include, due this code is consecutive at least until specify that not.

#include
import os
import pandas as pd

The reserved word import allow us called this code (libraries) , if you are in Windows operating system the most probable is that you need open your Command prompt and type sequentially: pip install pandas

os is a built-in library and allow us specify the working directory, namely set up in your computer where is the path to load or save your dataset or excel files. In Windows operating system this path usually begin with C: or a letter. “C:/my_user/my_folder”

#include

os.chdir("C:/my_user/my_folder")

pandas provide a object class similar to excel spreadsheet that allow us put the information in rows and columns, or observations and variables. See how a dataframe looks in Python.

data=[[1,2,3,4,5],["A", "B", "A", "B", "A"]]
print("A data Frame with 2 Rows and 4 columns : \n ")
print("-"*16)
print(pd.DataFrame(data))
print("-"*16)
A data Frame with 2 Rows and 4 columns :

----------------
0 1 2 3 4
0 1 2 3 4 5
1 A B A B A
----------------

Now the objective it is used the pd.read_excel() to load your dataset in a Python DataFrame.

#include

df=pd.read_excel("file.xlsx")

There are important points that are necesary stress, the first of them is that in some cases you need specify the sheet name. df=pd.read_excel("File.xls, sheet="sheet_name"). One of the most important point is that each variable name in the excel file it is put only in the first row, and there are not merged cells. Also the acronym df of DataFrame it is only used by tradition ( Altough this a recommended practice).

To work with data is important distinguish among string and numerical variables due, each of them has a different ways of described. Categorical variables as educational level, profession and other are described with absolute and relative frequency, in otherwise numerical could be described with median and standard deviation or median and interquartile range according its distribution since either normal or not.

>>> lista=[[1,"one"], [2,"two"], [3,"three"]]
>>> print("Dataset:\n")
>>> df=pd.DataFrame(lista, columns=["Numerical", "String"])
>>>
print(df)
print("\n")
Dataset:

Numerical String
0 1 one
1 2 two
2 3 three

In pandas the type object is similar to string or categorical and we can associated int with integers and float with floating point numbers or decimal. To check the type of variable in a Dataframe, we need used the method dtype. Suppose that your dataset looks like the above and it is ready to load in python memory. To verify the type of data, is important select columns or rows. In a general way we can access to a column thus: df["Variable_name"]

df["Numerical"]0    1
1 2
2 3
Name: Numerical, dtype: int64
print("The type of the Numerical variable is ",df["Numerical"].dtypes)


print("The type of the String Variable is ",df["String"].dtypes)
The type of the Numerical variable is int64
The type of the String Variable is object

Notice that is important to get the appropriate descriptive measures according to data type. Therefore, we use the methods, describe() in numerical variables to get a set of measures as; min,max, median, p50(median) and another relative position measures. In otherwise, in categorical variables we need used .value_counts() to get the number of times that appear a measure in a variable.

df["Numerical"].describe()count    3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: Numerical, dtype: float64
df["String"].value_counts()three 1
one 1
two 1
Name: String, dtype: int64

Notice that each number only appear one time in dataset, however to get the relative frequency we need add the number of rows or observations that are recorded in the variable or dataset (this method count missing values?) with the built-in function len() we can get this number you can also use shape or index.

print("The number it is", len(df))The number it is 3print(df["String"].value_counts()/len(df))three    0.333333
one 0.333333
two 0.333333
Name: String, dtype: float64

This add up the 100% how we hope.

There are another ways of get this results automatically for all variables, but before we need to understand an important concept in programming denominated loop.

for k in range(5):
print(k)
0
1
2
3
4

Intuitively, this loop repeat the operation of print the current number of k, that go from 0 to 4, the program not include 5.

for x in range(4):
print("This message will be printed in screen four times")
This message will be printed in screen four times
This message will be printed in screen four times
This message will be printed in screen four times
This message will be printed in screen four times

This introduce us in a important concept in some Python objects denominated iterable this means that this data type we allow us access to each one of its elements. for instance, a list is a class of object that is iterable, now we will see how this works.

elements=[1,2,3,4]
elements_st=["first","second","third","fourth"]
for a in elements:
print("The number of position in arabic is", a)
The number of position in arabic is 1
The number of position in arabic is 2
The number of position in arabic is 3
The number of position in arabic is 4
for x in elements_st:
print(x)
first
second
third
fourth
for a in elements:
print("The number of position in arabic is", a , "While in words", elements_st[a-1])
The number of position in arabic is 1 While in words first
The number of position in arabic is 2 While in words second
The number of position in arabic is 3 While in words third
The number of position in arabic is 4 While in words fourth

Note that each element has been printed in screen, in the appropriate order.

This will be important due df.columns save all variables names in a iterable object. Now we can see how to apply a loop to get the name of each variable and its type.

for x in df.columns:
print("The variable",x, "was stored in", df[x].dtypes)
The variable Numerical was stored in int64
The variable String was stored in object

In a general way we can put some statements to be executed only by some true conditions for instance, we have an iterable object that contain 10 elements and we need only print in screen the numbers higher to 6.

for x in range(11):
if x>6:
print(x)
7
8
9
10

Notice that only the numbers are major to six are printed in screen. The operator $>$ not is the only one to assess expressions, we could need check if two values are equal using == or if they are different != and the results of this expression turn back a Boolean value to indicate if the expression is True or False.

set_n=[1,2,3,4,5,6]

In the list set_n we have a set of numbers and we are interested in check if the value 7 is contained in this list, then if the condition is true then the program could execute a set of statements.

4 in set_nTrue

Because of the list contained the value 4, then the result will be True. This is important, due given a set of data types or features we can discriminate code to each one.

if 4 in set_n:
print("The number is in the list!")
The number is in the list!

Note in the above example, that the statement print("The number it is the list!") will be not executed if the number not it is in the list.

if 999 in set_n:
print("The number is in the list!")

Effectively there are not a output due the condition not is true, then we could take advantage of this and get automatic descriptive statistics upon all string and numeric variables in any dataset.

#include
for x in df.columns:
if df[x].dtypes==object :
print("Due",x ,"is a ", df[x].dtype)
print("\n we can get its frequency table")
print(df[x].value_counts() / len(df)*100)
Due String is a object

we can get its frequency table
three 33.333333
one 33.333333
two 33.333333
Name: String, dtype: float64

We can extend to a more general dataset, to see how this work without explicit the name of variables.

dataset=[["A",1,"circle",10], ["A",2,"circle",11],["C",2,"line",12]]

We have a list with four variables now we convert to dataframe.

df=pd.DataFrame(dataset, columns=["letter","number","shape","number2"])
print(df)
letter number shape number2
0 A 1 circle 10
1 A 2 circle 11
2 C 2 line 12
#include
for x in df.columns:
if df[x].dtypes==object :
print("Due",x ,"is a ", df[x].dtype)
print("\n We can get its frequency table")
print(df[x].value_counts() / len(df)*100)
Due letter is a object

We can get its frequency table
A 66.666667
C 33.333333
Name: letter, dtype: float64
Due shape is a object

We can get its frequency table
circle 66.666667
line 33.333333
Name: shape, dtype: float64

The method describe() we can used directly in the object df

print(df.describe())number  number2
count 3.000000 3.0
mean 1.666667 11.0
std 0.577350 1.0
min 1.000000 10.0
25% 1.500000 10.5
50% 2.000000 11.0
75% 2.000000 11.5
max 2.000000 12.0

This last output could be save in a object and be exported to a spreadsheet. Using pd.to_excel() function of pandas

numeric_results=df.describe()
numeric_results.to_excel("numeric_resutls.xls")

This will created a file with xls extension in the folder we defined with os.chdir(PATH..).

At the end your script must be similar to the following compile code, to describe any dataset.

import os
import pandas as pd
import numpy
os.chdir("C:/my_user/my_folder")
df=pd.read_excel("file.xlsx")
# To get frecuency table.
for x in df.columns:
if df[x].dtypes==object :
print("Due",x ,"is a ", df[x].dtype)
print("\n we can get its frecuency table")
print(df[x].value_counts() / len(df)*100)

# To get numeric measures.
print(df.describe())

--

--