Data science stylistics

Stylistic differences between R and Python for exploratory data analysis

Nicola Giordano
Analytics Vidhya
Published in
8 min readMar 11, 2020

--

After having prepared the data as explained in my previous blog, individuals might have a priori hypothesis that they would like to test (HT). In other cases exploratory data analysis (EDA) is the driver for seeking significant patterns within a dataframe. The approach for exploration can be multifaceted and should lead to visual inputs that can describe relationships and set parameters to develop an accurate statistical model. The purpose of data exploration based on visual outputs is explained by the figure below.

Purposes during exploratory data analysis

The following steps aims at providing an initial range of visual solutions to address these three objectives and advance in the development of the most suitable statistical model. Another objective remains to underline some differences and similarities in the semantics within Python and R.

1. Re-create testable dataframes

The generation of visual outputs benefits from the creation of multiple categorical variables to create overlays. To augment the complexity of our dataframe in Python, the addition of a new categorical variable, namely “Balance”, is shown in the code below.

df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns= ('A','B','C'))Place = (['PlaceA','PlaceB','PlaceC'])
df["Place"] = np.random.choice(Place, size=len(df))
Balance = (["Credit", "Debit","Zero"])
df["Balance"] = np.random.choice(Balance, size=len(df))
df['index']=pd.Series(range(0,100))df['Geolocation']=df['Place']
dict_place={"Geolocation":{"PlaceA":"London", "PlaceB":"Delhi", "PlaceC":"Rome"}}
df.replace(dict_place, inplace=True)
df.head()

In R the generation of an augmented dataframe follows a similar process. The most noticeable difference is in the addition of random categorical variables, which requires the specification of a probability distribution instead of an automated function from another library. The structure of the code is otherwise very similar with few differences in semantics.

df <- data.frame(replicate(3,sample(0:100, 100, rep=TRUE)))
colnames(df) <- c("A","B","C")
df$Place <- sample(c("PlaceA", "PlaceB","PlaceC"), size = nrow(df), prob = c(0.76, 0.14,0.1), replace = TRUE)
df$Balance <- sample(c("Credit", "Debit","Zero"), size = nrow(df), prob = c(0.70, 0.1,0.45), replace = TRUE)
n<-dim(df)[1]
df$index<-c(1:n)
dict_place<-revalue(x=df$Place, replace= c("PlaceA"="London", "PlaceB"="Delhi", "PlaceC"="Rome"))
df$Geolocation<-dict_place
head(df)

2. Bar graphs with response overlay

Once created the dataframes in both languages, the next step is to visualise relationships through graphical means. A bar graph with response overlay can serve the purpose of exploring the relationship between a categorical predictor and a target variable.

In order to do that, the first step is size control for each graph by establishing modifiable parameters. In Python, one of the most established library is matplotlib.pyplot which is also followed by %matplotlib inline, a critical command to add plots within a Jupyter notebook. The library enables the output to follow a figure size as shown below.

import matplotlib.pyplot as plt
%matplotlib inline
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 5
plt.rcParams["figure.figsize"] = fig_size

After setting up a figure size based on optimal values, the next step would be to visualise a relationship between two categorical variables. In this case “Place” and “Balance” would feature in the output as a simple stacked bar histogram showing a frequency count distributed across the three attributes of the response variable. In Pandas logic, the first variable after a crosstab function is the predictor (x) and the second variable features as the target (y).

crosstab_01=pd.crosstab(df['Place'],df['Balance'])
crosstab_01.plot(kind='bar', stacked=True)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.9))

The frequency distribution would need to be normalised by counting based on proportionality within each categorical attribute. The combination of div and sum commands will treat the distribution of values within each attribute of predictor variable “Place” as separate and their sums is divided by axis 0, which in Python represents rows and exemplified by variable “Balance” here.

crosstab_norm=crosstab_01.div(crosstab_01.sum(1),axis=0)
crosstab_norm.plot(kind='bar', stacked=True)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.9))

When setting a size dimension for a plot, R allows the utilisation of a function that can be applied to any graph through a simple recall (set_plot_dimension) and modification of numerical parameters for size. The function structure allows to control the behavior of repr package while not calling it directly and provides flexibility in changing parameters within brackets.

set_plot_dimensions <- function(width_choice, height_choice) 
{options(repr.plot.width = width_choice, repr.plot.height = height_choice)}
set_plot_dimensions(8,4)

Once setting the size of a plot, R can produce appealing visual outputs with a simpler code. After uploading ggplot2 library, the parameters to define are the name of the dataframe and the value of interest (predictor) in first instance. The second step requires to describe the target variable within the function geom_bar by defining the variable to characterise the aesthetic mappings (aes). Another cool feature in R is the ability flip the graph by using coord_flip and the in-built automation of design principles that generate attractive visual outputs.

library(ggplot2)
ggplot(df, aes(Place))+geom_bar(aes(fill=Balance))+ coord_flip()

To normalise the distribution, the addition to fill specification in geom_bar, the position of the fill will achieve the proportional distribution of answer within each categorical attribute of the predictor variable (Place). The simple addition does not require further specification since already embedded in the function applied to the target variable.

ggplot(df, aes(Place))+geom_bar(aes(fill=Balance), position='fill')+ coord_flip()

3. Histograms with response overlay

The histogram is a graphical representation of a frequency distribution for a numerical variable. In order to ascertain patterns in the response distribution, a similar approach of transforming frequency count into proportions applies. In Python the construction of a layered frequency graph for a numerical variable requires the creation of two subsets corresponding to one categorical attributed and its distribution along a numerical variable. The way to visualise a numerical distribution, bin commands a number of divisions along a distribution while stacked compiles a disaggregation along one bar.

import numpy.np
import matplotlib.pyplot as plt
df_Debit= df[df.Balance=='Debit']['A']
df_Credit= df[df.Balance=='Credit']['A']
df_Zero= df[df.Balance=='Zero']['A'
plt.hist([df_Debit,df_Credit,df_Zero], bins=10, stacked=True)
plt.legend(['Debit', "Credit",'Zero'])
plt.title('Histogram of variable A with response overlay')
plt.xlabel(''); plt.ylabel('Count'); plt.show()

To normalise the distribution, the code would get a lot more complex if to keep the bins spacing. An initial two-column matrix where columns hold the heights of each bar requires a normalisation through the division of each row by the sum across that row. After producing a normalised table of adjusted proportions for each column in the dataframe, the creation of a visual distribution would simply require a barplot from the table of proportions. This approach without defining the actual spacing between bins is simpler and results in a graph treating each bin as a separate column. From this example, Python seems versatile as it allows the user to be more hands-on in defining spaces and boundaries between objects but also offers shortcuts though at the cost of reducing control over the aesthetic.

(n, bins, patches)=plt.hist([df_Debit,df_Credit,df_Zero], bins=10, stacked=True)n_table=np.column_stack((n[0],n[1],n[2]))
n_norm=n_table/n_table.sum(axis=1)[:,None]
n_norm = pd.DataFrame(n_norm, columns= ('Debit','Credit','Zero'))
n_norm.plot(kind='bar', stacked=True)
plt.legend(['Debit','Credit','Zero'])
plt.title('Normalised histogram of variable A with response overlay')
plt.xlabel('Variable A'); plt.ylabel('Proportion'); plt.show()

Aesthetic is a central feature in R. This simplifies the amount of code required for graphs. In this case, to produce a frequency count, setting the parameter in geom_hist function is sufficient to disaggregated by categorical attribute.

ggplot(df,aes(A))+geom_histogram(aes(fill=Balance),color='black')

To produce a histogram representing the proportional distribution within each bin, the calculation is embedded within geom_histogram. By specifying the position of the histogram within the same fill variable, the proportion is automatically computed and presented in visual form.

ggplot(df,aes(A))+geom_histogram(aes(fill=Balance), color='black', binwidth=10, position='fill')

4. Prediction-based binning

Binning as a strategy to derive new categorical variables is an avenue to place values within a specific range based on how different sets of values of the numeric predictor behave with respect to the response variable. A recurrent distribution within a specific bin indicates a relationship with potential value for predictions. This analysis helps with the identification of stronger or weaker associations between categorical and numerical variables.

In Python the first step would be to graph a frequency count based on new binning criteria which can also reflect percentiles or quantiles and can be set as preferred values within the bins command. Based on the new cut points a barplot of crosstab enables the output to stack a frequency count for the various categorical attributes.

df['binning']=pd.cut(x=df['A'], bins=[0,30,60,100], labels=['Under 30', '30 to 60','Over 60'], right=False)crosstab_02=pd.crosstab(df['binning'],df['Balance'])
crosstab_02.plot(kind='bar', stacked=True, title='Bar graph of binned A variable with response overlay')

The normalisation of the distribution of values across new bins is based on proportions across categorical values. This computation can be achieved by introducing a lambda function that produce a proportional table to plot in Python. The other features of the plot (legend, title,…) are defined as before.

table=pd.crosstab(df.binning, df.Balance).apply(lambda r: r/r.sum(), axis=1)table.plot(kind='bar',stacked=True)
plt.legend(['Debit','Credit','Zero'])
plt.title('Normalised histogram of variable A with response overlay')
plt.xlabel('Variable A'); plt.ylabel('Proportion'); plt.show()

In R the binning process follows a similar language and the function cut generates breaks in the distribution. The visual output in ggplot without specifying the position of the fill shows a frequency count.

df$binning<-cut(x=df$A, breaks=c(0,30,60,101), right=FALSE, 
labels= c("Under 30","30 to 60", "Over 60"))
ggplot(df, aes(binning))+geom_bar(aes(fill=Balance))+ coord_flip()

By specifying the position of the fill and stat="count" the visual outputs provides the proportion within each categorical attribute. The calculation of this proportion is done automatically by using stat which makes the height of the bar proportion to the number of cases in each group.

ggplot(df,aes(binning))+ geom_histogram(aes(fill=Balance), color='black', stat="count", position='fill')

After this step, the next phase of the proposed data science methodology is to setup a model for computation. This is explained in the next blog

--

--

Nicola Giordano
Analytics Vidhya

Cultivating an interest in applying computational analysis to international humanitarian work and social sciences.