Visualization with Python-Seaborn Library(Part 1)

Published in

CNK Tech

6 min readApr 13, 2021

In this article, we’ll begin visualization with Python. I’m still learning, continue a course, so this material is about course. Let’s start!

I will use Kaggle and a dataset in Kaggle. You can go to my Kaggle profile and examine my notebooks:

https://www.kaggle.com/serapbaysal61

Of course you don’t have to use Kaggle, you can install dataset and use it on your pc. The dataset’s link is here too:

https://www.kaggle.com/kwullum/fatal-police-shootings-in-the-us

Firstly, we’ll import libraries that we’ll use:

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter

In this dataset, we have so many .csv files, we’ll read them with pandas library’s read_csv method:

median_house_hold_in_come = pd.read_csv('../input/fatal-police-shootings-in-the-us/MedianHouseholdIncome2015.csv')
percentage_people_below_poverty_level = pd.read_csv('../input/fatal-police-shootings-in-the-us/PercentagePeopleBelowPovertyLevel.csv')
percent_over_25_completed_highSchool = pd.read_csv('../input/fatal-police-shootings-in-the-us/PercentOver25CompletedHighSchool.csv')
share_race_city = pd.read_csv('../input/fatal-police-shootings-in-the-us/ShareRaceByCity.csv')
kill = pd.read_csv('../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv')

We can’t use them in just one article but in next articles we’ll do. Now, we’ll use percentage_people_below_poverty_level.csv file.

We have to examine data, for this, we’ll use head method:

percentage_people_below_poverty_level.head()

We know a few things about data. We have Geographic area, city and poverty rate in cities. America has geographic areas an inside them has cities.

We’ll use info method too. With this, we will learn about data’s columns, counts and data types.

Now, we‘ll control our datas for meaningless datas. For this, we’ll use value_counts method and data’s columns. We have 3 columns: Geographic Area, City and poverty_rate. In data’s poverty_rate column, we have some meaningless datas, so we’ll fix it. To fix that data, we have a few way. I will fix it changing them with 0.

For this, we’ll use replace function:

percentage_people_below_poverty_level.poverty_rate.replace(['-'],0.0, inplace = True)

In info method, I said we have data types. Even poverty_rate is object(that means string) and it’s not useful. We’ll change it with astype(float).

percentage_people_below_poverty_level.poverty_rate = percentage_people_below_poverty_level.poverty_rate.astype(float)

Now, we’ll look the poverty rates in given states. For this, we need area_list with unique:

area_list = list(percentage_people_below_poverty_level[‘Geographic Area’].unique())

Then, create a range. We’ll look inside area_list with for loop, calculate area_poverty_rate and add them in range:


for i in area_list: x=percentage_people_below_poverty_level[percentage_people_below_poverty_level['Geographic Area'] == i]
    area_poverty_rate = sum(x.poverty_rate)/len(x)
    area_poverty_ratio.append(area_poverty_rate)

We’ll create DataFrame named data, and give it area_list and area_poverty_ratio, create an index and sort data with reindex method:

data = pd.DataFrame({'area_list': area_list, 'area_poverty_ratio':area_poverty_ratio})
new_index = (data['area_poverty_ratio'].sort_values(ascending = False)).index.values
sorted_data = data.reindex(new_index)

Now, we’ll visualize the data. It’s simple, we’ll use barplot for this:

plt.figure(figsize = (15,10))
sns.barplot(x=sorted_data['area_list'], y = sorted_data['area_poverty_ratio'])
plt.xticks(rotation = 90) 
plt.xlabel('States')
plt.ylabel('Poverty Rate')
plt.title('Poverty Rate Given States')

In barplot, we have two data for x axis and y axis. We have title and labels:

So, we’ll find 15 people names and surnames who are killed. For this, we’ll use kill data.

We’ll examine data with head and value_counts methods and see in name column we have meaningless name ‘TK TK’. We’ll eliminate that names. Then zip name and surnames, create name_count, and give it to most_common function:

kill.head()
kill.name.value_counts()   #TK TK olan isimler anlamsız
separate = kill.name[kill.name != 'TK TK'].str.split()
a,b = zip(*separate)
name_list = a + b
name_count = Counter(name_list)
most_common_names = name_count.most_common(15)
x,y = zip(*most_common_names)
x,y = list(x),list(y)

Let’s visualize that data too with:

plt.figure(figsize = (15,10))
ax = sns.barplot(x=x,y=y,palette = sns.cubehelix_palette(len(x)))  # cubehelix_palette just gives colors
plt.xlabel('Name or Surname of killed people')
plt.ylabel('Frequency')
plt.title('Most common 15 Name or Surname of killed people')

Now, we’ll looking for persons over 25 years and completed highschool. For this we have a dataset named percent_over_25_completed_highschool. We’ll examine this with head and value counts. We have geographic area , city and percent_completed_hs and some meaningful data in percent_completed_hs. So we’ll change it with 0.0 and continue.

percent_over_25_completed_highSchool.percent_completed_hs.replace(['-'], 0.0, inplace = True)

With info, we’ll see datatypes as objects and we’ll change it with float:

percent_over_25_completed_highSchool.percent_completed_hs = percent_over_25_completed_highSchool.percent_completed_hs.astype(float)

Again, we’ll find unique area_list in Geographic Area. We’ll create a for loop in area_list and looking for area_highschool_rate and append it in area_highchool range.

area_list = list(percent_over_25_completed_highSchool['Geographic Area'].unique())
area_list
area_highschool = []
for i in area_list:
    x = percent_over_25_completed_highSchool[percent_over_25_completed_highSchool['Geographic Area'] == i]
    area_highschool_rate = sum(x.percent_completed_hs)/len(x)
    area_highschool.append(area_highschool_rate)
area_highschool

Then, we’ll sort data with new_index:

data = pd.DataFrame({'area_list':area_list,'area_highschool_ratio':area_highschool})
new_index = (data['area_highschool_ratio'].sort_values(ascending = True)).index.values   #artana göre sırala
sorted_data2 = data.reindex(new_index)

I don’t explain this part because we did these before. Now, we’ll visualize data with barplot:

plt.figure(figsize = (15,10))
sns.barplot(x = sorted_data2['area_list'], y = sorted_data2['area_highschool_ratio'])
plt.xticks(rotation = 90)
plt.xlabel('States')
plt.ylabel('High School Graduate Rate')
plt.title("Percentage of Given State's Population Above 25 that has Graduated High School")

Let’s look to percentage of state’s population according to races that are black, white, native american, asian and hispanic. In this data, we have geographic area, city, share white,black,native american,asian and hispanic .

So when we examine that data with info and count_values, we’ll see ‘-’ and ‘X’s. We’ll replace them and chang rate’s from object to float. Then we’ll find unique geographic areas, creating 5 diffrent ranges to races and append their averages:

share_race_city.replace(['-'], 0.0, inplace = True)
share_race_city.replace(['(X)'], 0.0, inplace = True)
share_race_city.loc[:,['share_white', 'share_black', 'share_native_american', 'share_asian', 'share_hispanic']] =share_race_city.loc[:,['share_white', 'share_black', 'share_native_american', 'share_asian', 'share_hispanic']].astype(float)
area_list = list(share_race_city['Geographic area'].unique())
share_white = []
share_black = []
share_native_american = []
share_asian = []
share_hispanic = []
for i in area_list:
    x = share_race_city[share_race_city['Geographic area' ] == i]
    share_white.append(sum(x.share_white)/len(x))
    share_black.append(sum(x.share_black)/len(x))
    share_native_american.append(sum(x.share_native_american)/len(x))
    share_asian.append(sum(x.share_asian)/len(x))
    share_hispanic.append(sum(x.share_hispanic)/len(x))

Visualize these with five different horizontal barplots. For this, we’ll give each barplots rate as x, area_list as y and colors, transparency as alpa, and races as label.

f, ax = plt.subplots(figsize = (9,15))
sns.barplot(x = share_white, y= area_list, color = 'blue', alpha = 0.5, label = 'White')
sns.barplot(x = share_black, y= area_list, color = 'pink', alpha = 0.5, label = 'African American')
sns.barplot(x = share_native_american, y= area_list, color = 'yellow', alpha = 0.5, label = 'Native American')
sns.barplot(x = share_asian, y= area_list, color = 'cyan', alpha = 0.5, label = 'Asian')
sns.barplot(x = share_hispanic, y= area_list, color = 'green', alpha = 0.5, label = 'Hispanic')ax.legend(loc = 'lower right', frameon = True)   
ax.set(xlabel = 'Percentage of Races', ylabel = 'States', title = "Percentage of State's Population According to Races")

Now let’s see highschool graduation rate vs Poverty rate of each state. We defined sorted_data and sorted_data2 before. We’ll normalize them .If we don’t, values will meaningful because some are so high and others are so small. Then concat two data with concat function.

sorted_data['area_poverty_ratio'] = sorted_data['area_poverty_ratio']/max(sorted_data['area_poverty_ratio'])   # normalization
sorted_data2['area_highschool_ratio'] = sorted_data2['area_highschool_ratio']/max(sorted_data2['area_highschool_ratio'])
data = pd.concat([sorted_data, sorted_data2['area_highschool_ratio']], axis = 1)
data.sort_values('area_poverty_ratio', inplace = True)

For seeing relation about highchool graduate and poverty rate, we’ll visualize with pointplot:

f, ax1 = plt.subplots(figsize = (20,10))
sns.pointplot(x = 'area_list', y = 'area_poverty_ratio', data = data, color = 'pink', alpha = 0.8)
sns.pointplot(x = 'area_list', y = 'area_highschool_ratio', data = data, color = 'blue', alpha = 0.8)
plt.text(40, 0.6, 'high school graduate ratio', color = "blue", fontsize = 17, style = 'italic')
plt.text(40, 0.55, 'poverty ratio', color = 'pink', fontsize = 17, style = 'italic')
plt.xlabel('States', fontsize = 15, color = 'red')
plt.ylabel('Values', fontsize = 15, color = 'red')
plt.title('High School Graduate VS Poverty Rate', fontsize = 20, color = 'red')
plt.grid()

Seaborn library is so huge, so I can’t explain all thing in one artical. Let’s meet at next article!

Visualization with Python-Seaborn Library(Part 1)

Written by Serap Baysal