Exchange Rate Prediction (Part 1): EDA & Data Visualisation

This part is a kick-off to Data Science using EDA to clean and structure the visibility of the dataset, followed by visualization and hypothesis testing.

Andrew Nguyen
Analytics Vidhya
9 min readMay 25, 2020

--

Just some disclaimers before diving in:

1. This project is one of the series built and developed with the aim of improving my Data Science skillsets in the long run, both technical and analytical.
2. However, I’m not formally trained or with a higher degree in data science. This is all self-learning ranging from online courses, medium reads to everyday practice.

Hence, your feedback is greatly appreciated!

Table of Content

Let’s get started! This is an end-to-end project divided into 3 parts and covering the following sections:

  1. Part 1: Explanatory Data Analysis (EDA) & Data Visualisation (Bonus: Hypothesis Testing)
  2. Part 2: Machine Learning with 4 Regression Models
  3. Part 3: Machine Learning (cont.) with ARIMA

As a kick-off, this article will cover the 1st part of the project, so stay tuned for the other two in the coming days! If you’d like to see the entire repository, scroll down the end to see my Github link ;)

Introduction

Exchange Rate is one of the daily topics that captures everyone’s attention, especially in the current Covid-19 situation where the global economy is severely affected. Therefore, I’m highly interested in analyzing the exchange rate pattern in the past so as to be able to forecast the value in the future. If you are as keen as I am, you can access this dataset here from Kaggle for your own exploration.

A. Explanatory Data Analysis (EDA)

First, let’s import two fundamental Python libraries, Pandas and Numpy and read the data:

To briefly understand our dataset, use df.head() to see the first 5 rows, df.columns to see the variables:

At first glance, I see an unusual variable named “Unnamed:0”, which should be removed.

Next, the column names seem too confusing and difficult to call later, so I will standardize them to “Country Currency/USD” (e.g. AUD/USD).

As you can see, I created a new list with new country currency codes, capitalized them, and added “_USD” to each column name. There you go, a fresher dataset to look at!

Looking through the entire dataset again, there are some unusual values named “ND”. So I replaced them with “NaN” values by np.nan and then drop all “NaN” values at one go. If no “NaN” value is found using df.isna().sum(), we are good to go!

B. Data Visualisation

Let’s look at the dataset in a more visually engaging format!

First, we need to make sure that the values are numeric instead of objects by calling df.dtypes and pd.to_numeric for type conversion.

Use a for loop to apply the change to all columns at once instead of repeating the process all over again. Errors parameter in the function can be set to “ignore” (return invalid values as is) or “coerce” (return invalid values as NaN).

As the date format is currently YYYY-MM-DD, which is relatively packed when it comes to charts, I will add another column formatted as YYYY-MM and use it in replacement.

pd.to_datetime to convert the values from object to datetime, then we extract the month and year with .dt.month and .dt.year. Finally, we created a new column called “month_year” using .dt.to_period(‘M’). You can refer to this documentation for full details.

Let’s visualise the exchange rate of AUD/USD for simplicity first.

By using .count(), we know that there are 5k+ values for each column. That’s a huge number! So instead, we will take the average rate of each month, assuming the mean represents the entire month. Groupby to the rescue!

Let's import Matplotlib library for our first visual chart!

Okay, we now can see the pattern for AUD/USD. Let’s upgrade our chart with more than one exchange rate.

In order to do groupby for multiple columns, we use lambda to apply a function to a Series of columns (in this case, .mean()), then convert back to DataFrame for easier use.

The reason why I created a column of USD_mean and added a value of 1 is to illustrate the difference of all exchange rates against the baseline visually. List comprehension is the easiest way to create such list.

Here you are!! It is seen that the pattern of all exchange rates against USD baseline is pretty much the same. Such peak is seen between the year of 2008 and 2010, which can be explained by the Great Recession in 2008.

Apart from Matplotlib, there are several other libraries for visualisation such as Plotly or Bokeh, which provide more options for interaction. If you are keen to explore, check out my Github at the end for the full code. Here’s a brief chart using Bokeh:

C. Hypothesis Testing

To be completely honest, I’m still struggling with tons of different elements, theories, and functions behind Hypothesis Testing.

That’s why I decided to practice a bit in this project by asking myself whether these variables in the dataset are different from one another or not.

Let’s use AUD_mean and SGD_mean for this example. The reason I used these two variables is that I see a relative difference in the chart above. So this test is to reassure or refute the assumption I was having.

H0 (Null Hypothesis): The mean of AUD is the same as that of SGD.
H1 (Alternative Hypothesis): The mean of AUD is significantly different from that of SGD.

However, before conducting a hypothesis testing, there are some prerequisites and process we need to be aware of:

  1. Take the sample dataset out of each population
  2. Test the normality of each sample set and the correlation of two
  3. Select the right hypothesis test to use

For the 1. task, the question is “how much data should we take out of each population?”. If you recall, we have more than 5k+ values (the population) of each variable. To do sample sizing, we can rely on Power Analysis.

But first, let’s prepare the datasets of this section:

Then, we import the statsmodels library to conduct Power Analysis:

For full detail, please refer to this documentation here. In short, TTestPower is a statistical power calculation for paired sample t-test. We know these variables are paired (or connected to some extent) since they are divided by USD exchange rate. .solve_power() is a function which includes 3 core parameters: effect size, power and alpha (aka significance level):

  1. Effect size: in layman’s term, it refers to the difference between the two groups (e.g. the mean difference of two groups) divided by the standard deviation of one of the groups. In case there is not enough data for calculation, we can use the effect size from the range of 0.3 to 0.8 when testing normality later to find the best effect size.
  2. Power: is the probability that we will reject the null hypothesis when we should. It is generally accepted that power should be 0.8 or 0.9, depending on how great you want your chance of finding a significant difference is.
  3. Alpha: is the error rate we are willing to accept, often set at 0.05 or 0.01. In brief, an alpha 0f 0.05 means we are willing to accept that there is a 5% chance that our results are due to chance/randomness.

If my explanation seems a bit difficult to grasp, I recommend you to look at different materials to really understand the concept, and one of those is this document here.

After running the Power Analysis, we found that the sample size for each variable is 118. Then, we use this number to randomly generate the sample set from each population:

Step 1 is done!

Move on to step 2, we need to test the normality of each sample set and the correlation of the two. In brief, it is required that two sample sets are normally distributed and not correlated with each other.

For details of H0 and H1 of each test, refer to this quick cheat sheet.

Back to our example, if we use the effect size of 0.3, the result indicates “not normally distributed”. As I increased the effect size and finally reached 0.8 (sample size is 18), the sample sets are normally distributed to be used for Hypothesis Testing. For Pearson test, they are independent from each other.

Finally, let’s import scipy library for Hypothesis Testing. The answer we are looking for is whether the means of AUD and SGD sample sets are equal or not. If pval <0.05, we can reject the null hypothesis saying that they are different.

However, it turned out that my assumption from the chart that their means are different has been rejected, as the pval >0.05!

Well, that’s a wrap for the 1st part of this project! Phew, a long read indeed?!

I hope you do enjoy my 1st Medium article and feel free to leave a comment here or on my Github! Do look out for my next 2 parts with regards to Machine Learning to see how I computed the predicted exchange rates for 2020!

Let’s connect!

Github: https://github.com/andrewnguyen07
LinkedIn: www.linkedin.com/in/andrewnguyen07

Thanks!

--

--

Andrew Nguyen
Analytics Vidhya

A data analyst, without a higher degree, aspires to master data science skillsets by on-the-side projects.