Data Analysis on Global Trade Statistics

Exploratory Data Analysis

Meenakshi Ravikumar
4 min readNov 12, 2021

Are you curious about fertilizer use in developing economies? The growth of Chinese steel exports? American chocolate consumption? Which parts of the world still use typewriters? You’ll find all of that and more here using a process called EDA.

Exploratory Data Analysis(EDA) is an approach of analyzing data sets to summarize their main characteristics, off-times using statistical graphics and other data visualization methods.

Agenda

The steps involved in the process of Exploratory Data Analysis are.,

  • Import the required libraries and its dependencies.
  • Download the dataset.
  • Data preparation and cleaning it.
  • Exploratory Analysis.
  • Ask and solve questions from the data.
  • Pictorial representation of data using visualization techniques.

Importing the libraries

Download the data

To download the dataset from Kaggle.,

  • Copy the link of the dataset, store it in a variable and use opendatasets library by providing your Kaggle credentials.
  • Copy the file path and store it in a variable.[Here data_filename].

Data preparation and cleaning

After downloading the dataset, the next step is to prepare the data for our usage by eliminating the unwanted values or missing values. Thereby, keeping the necessary information for our analysis. NumPy and Pandas will be of great help to do our analysis.

Our dataset is stored as a pandas dataframe in the variable Global_commodity_df.

After getting the source file, check for missing values, clean them and retain the useful data.

Exploratory Analysis

After the preparing the data, analyze it with the help of NumPy and Pandas. Use describe() to get the count, mean value, standard deviation, minimum and maximum values, display quartiles.

Using info() method, we get the data types of each column. Here you will be finding the int64 for two columns, float64 for two columns and object type for most columns.

Global_commodity_df.columns retrieves the columns in the dataframe. And Global_commodity_df.shape[0] takes the column with index “0” and counts the values in it.

The dataset which we are working with had 8.2 million rows of data and after cleaning reduced to 7.8 million rows of data.

.iloc is used to pick out specific column from the dataframe. Here, we are picking out countries column to find the number of countries involved in this dataset.

Next, let us focus on some interesting questions taken from this dataset.

  1. Which country imported the most commodities in the year 2016 ?
  2. What were the top 5 exports for Zambia in the year 2001?
  3. What was the best year for Albania between 2000 and 2016 in terms of exports?
  4. What were the top 5 countries that had the highest exports in the year 2010?

To know the answers for the above questions and codes used to perform the action, find the code attached at the end of this article.

Data Visualization

Finally, we will use some of the visualization techniques to view the data in pictorial representation.

  1. Lets see how line plot and scatter plots are used to compare the exports and imports of United Kingdom.
LINE PLOT
SCATTER PLOT

2. Display the most frequent word listed in the commodities column using word cloud.

WORD CLOUD

3.Comparing the import stats of India and China.

HISTOGRAM

4. Pick out three countries (Australia, China and India) from the countries column and picturize it via Raincloud, Andrews Curve and Violin plot.

To create a rain cloud impact, we are using half violin plot, adding jitters, strip plot and box plot.

RAIN CLOUD
ANDREWS CURVE
VIOLIN PLOT

5. Use count plot to find the number of occurrences of the following countries.,

  • Guatemala
  • Spain
  • Swaziland
  • Djibouti
  • New Caledonia
  • Mozambique
  • Honduras
COUNT PLOT

We picked out the above mentioned countries and stored in a dataframe countries7_df and charted it using count plot to find its occurrences over the years in the trade statistics.

6. With the same data countries7_df, use hexbin / honeycomb plot to compare the amount of commodities exported by these countries. Also create a correlation plot for the same.

HEXBIN PLOT

A correlogram is a way to visualize the correlation matrix. To create a correlation plot, we are going to use heatmap method. Before we create the correlogram, using Seaborn, we use Pandas corr method to create a correlation matrix. We are then using NumPy to remove to the upper half of the correlation matrix.

CORRELATION PLOT

Summary

In this project, we have explored the trade statistics of commodities exported and imported over 5,000 commodities across most countries on Earth over the last 30 years. As a first step of EDA, we first.,

  • selected a real-world dataset from https://www.kaggle.com
  • downloaded and stored as a pandas DataFrame
  • identified the missing values through NumPy and Pandas, and cleaned the dataset for further analysis
  • explored the numerical statistics of the data.
  • asked questions and picturized some using visualization techniques.

Future Work

We can get statistical report of the same during the pandemic and compare the exports and imports of the commodities, to get detailed view of how the trade has been in the past couple of years. This comparison would show the effects of pandemic on global trade and countries which had a huge impact due to the pandemic break out.

References

Here are few references, which I found useful during my study.

Check the below link for the full code:

--

--

Meenakshi Ravikumar

A meticulous and go-getter professional with a Master's Degree in Communication systems Engineering and 600+ hours of applied learning experience in Data Scienc