Data Science

Exploratory Data Analysis using Python — A Case Study

Analyzing agricultural-raw-material-prices dataset over the years.

4 min readJul 7, 2022


Exploratory data analysis on raw-materials dataset

This dataset consists of prices and price % change for coarse wool, copra, cotton, fine wool, hard log, hard sawn wood, hide, plywood, rubber, soft log, soft sawn wood, and wood pulp.

The following topics are covered in this article:

Index Of Contents
· Downloading the Dataset
· Data Preparation and Cleaning
· Exploratory Analysis and Visualization
· Asking and Answering Questions
· Inferences and Conclusion
· Future Work
· References

Downloading the Dataset

In this tutorial, we’ll analyze the agricultural-raw-material-prices dataset.

There are several options for getting the dataset into Jupyter:

  • Download the CSV manually and upload it via Jupyter’s GUI
  • Use the urlretrieve function from the urllib.request to download CSV files from a raw URL
  • Use a helper library, e.g., opendatasets, which contains a collection of curated datasets and provides a helper function for direct download.

We’ll use the opendatasets helper library to download the files.

Let’s begin by downloading the data, and listing the files within the dataset.

dataset_url = ''import opendatasets as od

Data Preparation and Cleaning

# Import required packages
import numpy as np
import pandas as pd

Loading the dataset into a data frame using Pandas


Exploring the dataset

df.shape (361, 25)
#Checking Null Values of each column

Handle missing, incorrect and invalid data

# Replacing %, "," and "-"
df = df.replace('%', '', regex=True)
df = df.replace(',', '', regex=True)
df = df.replace('-', '', regex=True)
df = df.replace('', np.nan)
df = df.replace('MAY90', np.nan)
# Dropping rows with NaN values
df = df.dropna()
# Check to see if all NaN values are resolved
# Converting data type to float
lst = ["Coarse wool Price", "Coarse wool price % Change", "Copra Price", "Copra price % Change", "Cotton price % Change","Fine wool Price", "Fine wool price % Change", "Hard log price % Change", "Hard sawnwood price % Change", "Hide price % change", "Plywood price % Change", "Rubber price % Change", "Softlog price % Change", "Soft sawnwood price % Change", "Wood pulp price % Change"]
df[lst] = df[lst].astype("float")

DateTime columns

formatting the datetime column and setting it as index for the dataset

df.Month  = pd.to_datetime(df.Month.str.upper(), format='%b%y', yearfirst=False)# Indexing month
df = df.set_index('Month')

Exploratory Analysis and Visualization

Let’s begin by importingmatplotlib.pyplot and seaborn.

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'


The Heatmap depicts correlation between the raw-materials higher the correlated value higher chance of being two raw-materials related but not necessarily

Explore one or more columns by plotting a graph below, and add some explanation about it Correlation matrix

From this plot we can say that there is almost no relation between % change of raw-material prices

  1. the negative value implies two variables are negatively correlated (one increase,other decrease)
  2. Zero implies no relation
  3. other wise higher the value higher the chance of relation.

prices and their % change plots

Coarse wool

Similarly we could find the visualization of all other features given in dataset.

Asking and Answering Questions

Q1: Find out the normal price change for each raw material

We can observe that most raw-materials have ideal frequent %change less than 5%

Q2: Find the raw-material that has lowest price over years

  1. We can see cotton and rubber are of lowest prices
  2. lets compare prices to better understand which is lowest

cotton is lowest price rawmaterials in recent years

Both cotton and rubber are low price materials

From the graphs we could analyze raw materials into types according their price over years

low price materials

-cotton,hide,softlog,Hard log,Soft sawnwood Price,rubber

High price materials

-coarse wool,copra,fine wool,hard sawnwood,woodpulp,plywood

Q3: which raw material has the highest and lowest price % change

import random as random

We can see the highest % change at more than 60 for soft sawnwood and lowest % change is for plywood at less at 20

Q4: Find the raw materials with drastic price change

Price change is drastic for hard log price among low price range materials and Among high price materials it is Fine wool prices

Q5: Figure out the price range of low priced raw-materials

Box Plot gives us the distribution of data

It includes: Inter quartile range is between Q3 and Q1 minimum, first quartile (Q1), median, third quartile (Q3), and maximum and outliers

Let us save and upload our work to Jovian before continuing.

Inferences and Conclusion

  1. We found out the high range and low range raw-materials according to their prices.
  2. high and low %Change materials
  3. We could identify the the range of prices change over the years.
  4. Correlation between them using a heatmap

Future Work

  1. Further we could find out relationship between raw-materials in details and with the help of business knowledge we could figure out the closest related raw materials
  2. We could also predict the prices of raw-materials by analysing the previous years
  3. Predicting the range of price change and based on that we could plan our raw material usage in advance for the coming the year and figure out any alternative materials.


Dataset link :

Thanks for reading out …

Hope you find this article helpful:)