A Summary of Visualization Practices for Beginners

Julia Yang
Analytics Vidhya
Published in
4 min readJan 10, 2021

This is a summary of some visualization methods can be used when treating different types of data. I tried out multiple visualization practices during my first two EDA attempts, so here I would summarize my trials for future references.

Here are my two attempts:
Who, Where and What — 2019 NYC Airbnb Analysis

https://www.kaggle.com/juliayyy/who-where-and-what-2019-nyc-airb-b-analysis

EDA, Visualization & NLP on US Data Analyst Jobs

https://www.kaggle.com/juliayyy/eda-visualization-nlp-on-us-data-analyst-jobs

I mainly used 3 packages as below:

  1. Matplotlib
  2. Seaborn
  3. plotly.express

In general, I felt that the core of visualization is always about making sense. Though pretty look makes report more readable, we’d better also be cautious using fancy methods. So in this article I want to summarize the types of techniques based on different user cases as below:

  1. Counting Numbers
  2. Comparing Numbers of Different Categories
  3. Contrasting Components of a Category
  4. Plotting Multiple Graphs Together

Here are the details:

1. Counting Numbers (typically exploring one column)

When doing EDA, counting numbers can be the most common tasks.
For instance, we may want to to see the distribution of salaries of all job. Or, we want to know how many airbnb homes do each host own. So based on different purpose we can also further divide the tasks to Distribution and Count&Rank.

Distribution Type:
Purpose: To see the overall numbers distribution of a particular category.
Tools: Seaborn (distplot | histplot), matplotlib(hist)

Code:
Seaborn:

sns.histplot(df['column'], color = 'y')

https://seaborn.pydata.org/generated/seaborn.histplot.html?highlight=hist#seaborn.histplot

Or Matplotlib can also do the work:

df['column'].hist(bins=n)

https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html?highlight=hist#matplotlib.pyplot.hist

Count&Rank Type:

Purpose: To count each type of numbers and rank them in decreasing / increasing order
Tools: Seaborn (countplot, barplot), matplotlib(barplot)

Seaborn:

sns.countplot(df["column"], order=[list])
sort_list = df["column"].count().sort_values(ascending = False).head(n) # to sort the column
order_list = [x for x in sort_list.index]

https://seaborn.pydata.org/generated/seaborn.countplot.html?highlight=countplot#seaborn.countplot

Or we can also use Matplotlib to do the trick:

df["column"].plot(kind="bar", title = "")

2. Comparing numbers of different categories (typically exploring multiple columns)

When we have continuous variables, such as salary, review numbers etc. we may want to compare the number across different categories. Due to different need level for details, we have different ways of plotting:

Compact Type:
Purpose: To gain a general idea of the comparison
Tools: Seaborn (pointplot)

Seaborn:

sns.pointplot(x="column1" ,y="column2", data=df)

https://seaborn.pydata.org/generated/seaborn.pointplot.html?highlight=pointplot#seaborn.pointplot

Detailed Type:
Purpose: To compare the median, IQR clearly.
Tools: Seaborn (boxplot, violineplot)

Seaborn Boxplot:

sns.boxplot(x='column1',y = 'column2', data=, whis =n, order =[list], palette="" )

https://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot

Or if we want to show the density of distribution of the variable (eg:salary) we can also choose strip/swarm plot:
Tools: Seaborn (stripplot, swarmplot)

Seaborn stripplot:

sns.stripplot(x = df['column1'], y = df['column2'], order =[list] )

https://seaborn.pydata.org/generated/seaborn.stripplot.html?highlight=stripplot#seaborn.stripplot

3. Contrasting components of a category

When doing EDA, we always want to see what are the components of a category and what are their proportions to the total. So based on the number of components, we can use different plotting method.
Few components type:
Purpose: to contrast the percentage of few different components of a single layer category
Tools: Matplotlib (Piechart)

plt.pie(df['column'].value_counts(), labels =[list], autopct = "", radius = n)

https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.pie.html?highlight=pie#matplotlib.pyplot.pie

Complex components type:
Purpose: to contrast the percentage of many different components of a multiple-layer category
Tools: Plotly.Express (Treemap)

fig = px.treemap(df, path=['column1', 'column2', 'column3'], values='', color= '')
fig.data[0].textinfo = 'label+text+value+percent parent'
fig.show()

https://plotly.com/python/treemaps/

4. Plotting multiple graphs together

From time to time we may want to show same/different type of plots together to compare or give more information. So here I want to categorize plotting strategies as below:
Same components with different information:
Tool: Matplotlib (Piechart) + Seaborn (pointplot)

fig,(ax0,ax1) = plt.subplots(nrows=1,ncols=2, sharey=False, figsize=(20,4))
ax0.pie(df['column'].value_counts(), labels =[list], autopct = "", radius = n)
sns.pointplot(x='column1',y="column2", data=df, ax = ax1)
plt.show()

Or
We can also combine countplot and pointplot in one graph:

plt.figure(figsize = (20,8))
sns.countplot(df["column"], order=[a list])
ax2 = plt.twinx()
sns.pointplot(x="column1" ,y="column2", data=df ,order=[a list])
plt.show()

Different components with same information:
Tool: Seaborn (catplot / relplot)

sns.catplot(x="column1",data=df, kind="count", col = "column2") plt.show()

https://seaborn.pydata.org/generated/seaborn.catplot.html?highlight=catplot#seaborn.catplot

Above are my major takeaways from my 2 visualization projects, the gist is to categorize and organize my thoughts when dealing with different data types. I feel the tasks of a business analyst/data analyst are always like polishing jewelries for clients, so we need to pick technics and strategize differently when treating diverse materials. This topic will be updated continuously along with my future attempts.

--

--