Exploratory Data Analysis using Python — A Case Study
This dataset consists of prices and price % change for coarse wool, copra, cotton, fine wool, hard log, hard sawn wood, hide, plywood, rubber, soft log, soft sawn wood, and wood pulp.
The following topics are covered in this article:
Downloading the Dataset
In this tutorial, we’ll analyze the agricultural-raw-material-prices dataset.
There are several options for getting the dataset into Jupyter:
- Download the CSV manually and upload it via Jupyter’s GUI
- Use the
urlretrievefunction from the
urllib.requestto download CSV files from a raw URL
- Use a helper library, e.g.,
opendatasets, which contains a collection of curated datasets and provides a helper function for direct download.
We’ll use the
opendatasets helper library to download the files.
Let’s begin by downloading the data, and listing the files within the dataset.
dataset_url = 'https://www.kaggle.com/kianwee/agricultural-raw-material-prices-19902020'import opendatasets as od
Data Preparation and Cleaning
# Import required packages
import numpy as np
import pandas as pd
Loading the dataset into a data frame using Pandas
Exploring the dataset
df.shape (361, 25)df.info
#Checking Null Values of each column
Handle missing, incorrect and invalid data
# Replacing %, "," and "-"
df = df.replace('%', '', regex=True)
df = df.replace(',', '', regex=True)
df = df.replace('-', '', regex=True)
df = df.replace('', np.nan)
df = df.replace('MAY90', np.nan)# Dropping rows with NaN values
df = df.dropna()# Check to see if all NaN values are resolved
df.isnull().sum()# Converting data type to float
lst = ["Coarse wool Price", "Coarse wool price % Change", "Copra Price", "Copra price % Change", "Cotton price % Change","Fine wool Price", "Fine wool price % Change", "Hard log price % Change", "Hard sawnwood price % Change", "Hide price % change", "Plywood price % Change", "Rubber price % Change", "Softlog price % Change", "Soft sawnwood price % Change", "Wood pulp price % Change"]
df[lst] = df[lst].astype("float")df.dtypes
formatting the datetime column and setting it as index for the dataset
df.Month = pd.to_datetime(df.Month.str.upper(), format='%b%y', yearfirst=False)# Indexing month
df = df.set_index('Month')
Exploratory Analysis and Visualization
Let’s begin by importing
import seaborn as sns
import matplotlib.pyplot as plt
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
The Heatmap depicts correlation between the raw-materials higher the correlated value higher chance of being two raw-materials related but not necessarily
Explore one or more columns by plotting a graph below, and add some explanation about it Correlation matrix
From this plot we can say that there is almost no relation between % change of raw-material prices
- the negative value implies two variables are negatively correlated (one increase,other decrease)
- Zero implies no relation
- other wise higher the value higher the chance of relation.
prices and their % change plots
Similarly we could find the visualization of all other features given in dataset.
Asking and Answering Questions
Q1: Find out the normal price change for each raw material
We can observe that most raw-materials have ideal frequent %change less than 5%
Q2: Find the raw-material that has lowest price over years
- We can see cotton and rubber are of lowest prices
- lets compare prices to better understand which is lowest
cotton is lowest price rawmaterials in recent years
Both cotton and rubber are low price materials
From the graphs we could analyze raw materials into types according their price over years
low price materials
-cotton,hide,softlog,Hard log,Soft sawnwood Price,rubber
High price materials
-coarse wool,copra,fine wool,hard sawnwood,woodpulp,plywood
Q3: which raw material has the highest and lowest price % change
import random as random
We can see the highest % change at more than 60 for soft sawnwood and lowest % change is for plywood at less at 20
Q4: Find the raw materials with drastic price change
Price change is drastic for hard log price among low price range materials and Among high price materials it is Fine wool prices
Q5: Figure out the price range of low priced raw-materials
Box Plot gives us the distribution of data
It includes: Inter quartile range is between Q3 and Q1 minimum, first quartile (Q1), median, third quartile (Q3), and maximum and outliers
Let us save and upload our work to Jovian before continuing.
Inferences and Conclusion
- We found out the high range and low range raw-materials according to their prices.
- high and low %Change materials
- We could identify the the range of prices change over the years.
- Correlation between them using a heatmap
- Further we could find out relationship between raw-materials in details and with the help of business knowledge we could figure out the closest related raw materials
- We could also predict the prices of raw-materials by analysing the previous years
- Predicting the range of price change and based on that we could plan our raw material usage in advance for the coming the year and figure out any alternative materials.
Thanks for reading out …
Hope you find this article helpful:)