Simple Pandas Usage and Data Visualization

Furkan Kızılay
5 min readAug 7, 2022

--

Hi everyone i’m : Furkan Kızılay

In this article, we will try to learn about the dataset by using pandas’ entry-level functions with a dataset containing the average temperature information of the countries. After that we will visualize the data using matplotlib and seaborn.

We will use a dataset containing the average temperature information of countries and cities. We will simply use pandas and visualize without questioning the accuracy of the data and without noisy data analysis, missing data analysis or outlier analysis.

Let’s Start

  • First let import the pandas framework
import pandas as pd

But firstly what is pandas ?

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.

  • Let’s assign the dataset containing the city temperatures to the variable data using pandas.
data = pd.read_csv("city_temperature.csv")
  • Let’s look at the first 5 observation units of the dataset.
data.head()
  • head() defaultly show 5 observation unit, if you want to see more observation unit you can give the parameter.

Which type of data include this columns, let’s find out?

data.dtypes
  • If you want to more information about dataset, you can use the info function.
data.info()
  • info() show us the columns non-null count and the columns types, also how much memory used.

Keep to examine, possible to see the central distribution measures of the features with a single line of code.

data.describe().T
  • The T at the end of the code means transpose.

How many observation units and columns are dataset, ı want to see this together.

data.shape

(2906327, 8)

Is there any null value in the dataset?

data.isnull()

That way it would be very difficult to see if there is any missing data, wouldn’t it? We can make this situation easy in a very simple way.

data.isnull().sum()

Upss, all data in the state column seems to be missing, let’s drop the column.

data.drop("State",axis = 1 ,inplace=True)
data.head()
  • The drop method allows us to delete columns or indexes in the data set.
  • If we set the axis parameter to 1, drop can delete columns, and the inplace parameter ensures that the change we make is permanent.

Let’s convert the object data type to the categorical data type.

columns = data.columns[0:3]
for i in columns :
data[i] = pd.Categorical(data[i])
data.dtypes

Let’s explore the data frame further to answer some specific questions.

How many cities are in the dataset?

data["City"].nunique()

321

How many data do we have from each country in our dataset?

data["Country"].value_counts()

How many data do we have from some specific country?

len(data[data["Country"] == "US"])

1455337

What is the mean average temperature?

data["AvgTemperature"].mean()

56.004920781458054

What is the max average temperature?

data["AvgTemperature"].max()

110.0

Is it 110, I wonder where is this place, let’s find out.

data[data["AvgTemperature"] == data["AvgTemperature"].max()]

How many data are there for cities in any country we choose?

data[data["Country"].isin(["Turkey","Australia","US"])]["City"].value_counts()

This example will be a bit silly but we should learn to use functions too.

Let’s find the average temperature of countries with the letter F in them.

def re_find(f):
if "f" in f.lower():
return True
return False
data[data["Country"].apply(re_find)]

So how do we find the average temperature on each continent? (groupby method will help us)

data.groupby(by = "Region").mean()["AvgTemperature"]

It may be more easy to see average temperatures on a graph rather than looking at a stack of numbers. Let’s do some visualization.

  • We need to import matplotlib library to make visualization.

But firstly what is matplotlib?

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

import matplotlib.pyplot as plt
%matplotlib inline
data.groupby(by = "Region").mean()["AvgTemperature"].plot(kind = "bar")

It’s easier to see which continent has more data this way.

It is possible to draw a graph where we can compare the average temperatures of the continents with each other. For this, we import the more advanced seaborn library, which is based on matplotlib.

import seaborn as sns
(sns
.FacetGrid(data,
hue= "Region",
height = 5,
xlim = (0,120))
.map(sns.kdeplot,"AvgTemperature",shade = True)
.add_legend()
);

What about comparison of average temperature by month?

plt.ylim(0, 120)
sns.boxplot(x = "Month", y = "AvgTemperature",data = data);

Since the values with an average temperature of -99 in the data set are evaluated incorrectly, we need to remove them from the data set.

data = data[~(data["AvgTemperature"] == -99)]
data.shape

(2826655, 8)

  • The ~ statement filters out the observation units that satisfy the condition in the parenthesis to the right.

Let’s try to draw the density graphs of the 3 countries we have chosen in the same graph.

data_selected = data[data["Country"].isin(["China","France","US"])]
sns.set_style("whitegrid")
plt.figure(figsize=(15,5))
countries = data_selected["Country"].unique()
for country in countries :
data = data_selected[data_selected["Country"]==country]["AvgTemperature"]
sns.distplot(data)
plt.legend(countries)

What did we do?

  • We examined the dataset using Pandas and tried to find the answers to the questions we asked to the dataset using Pandas.
  • We visualized the data we obtained so that the data can be read and understood more easily.

The dataset:

https://www.kaggle.com/datasets/subhamjain/temperature-of-all-countries-19952020

--

--

Furkan Kızılay

Computer Science Student — Interest in Data Science, ML, DL and AI