# COVID19’s effect on airport traffic: A Data Analysis in R

Download the data from here.

# Load your data (csv format)

`data = read.csv("location/filename.csv")`

# Exploration of Data

# Observing the data

`head(data) `

tail(data)

# Variable Types

`str(data)`

# Understanding the Data

## Important Variables

## Numerical Variables

- PercentOfBaseline

## Categorical Variables

- Date
- AirportName
- Centroid (of airports)
- City
- State
- Country
- Geography (of countries)

## Comments

- This is time-series data.
- The centroid and geography will be only helpful if we want to visualize the data on the world map.
- Country → State → City → AirportName are factor variables inside the next one.

# Focus

- Data Visualization is the main focus of the project.
- Due to the structure of Country → State → City → AirportName, the best way to visualize it in an app with dropdown options. (R Shiny App, Tableau, Power BI, etc.)
- Analyze each of the individual time series and forecast for the next “n” number of days.
- Data visualization of categorical variables and their relationship is not important, because they are interrelated and give no valuable insight.

# Exploratory Data Analysis

# Univariate (Kernel Density of PercentOfBaseline)

`library(ggplot2)`

ggplot(data, aes(x = PercentOfBaseline, fill = "red"))+geom_density(alpha = 0.2)

**Exercise**: Plot individual kernel density of each factor of Country, State, City, and AirportName.

# Bivariate

# Numerical vs Categorical

## - PercentofBaseline vs Country

`ggplot(data, aes(x = PercentOfBaseline, fill = Country))+geom_density(alpha = 0.2)`

## Observations:

- All the countries are behaving like normal mixtures with multiple modes.
- Australia has two distinct modes of normal mixtures. (Why?)
- Canada has a higher PercentOfBaseline than most of the other countries.

**Exercise**: Plot and compare kernel density of PercentOfBaseline of factors of Country, State, City, and AirportName.

# - PercentofBaseline vs Date (Time Series)

`ggplot(data, aes(x =Date, y = PercentOfBaseline, group = Country, colour = Country)) + geom_line() +facet_wrap(~ Country)`

## Observations:

- Except for Australia, the other countries have an increasing trend.
- Australia has two distinct modes of normal mixtures because, over time, the PercentOfBaseline has a major shift.

**Exercise**: Plot and compare the time series of PercentofBaseline of Country, State, City, and AirportName.

# Time Series Forecast

Time Series is a specific data structure in R.

We have to convert data into time series data structures to apply time series algorithms.

# Chile Data

`library(dplyr)`

data_chile = data %>% filter(Country == "Chile")

chile = data_chile[,c(2,5)]

head(chile)

# Convert into time-series data structure*

- 1,2,3,4,… as time index with no NA values.

`library(zoo)`

z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates

time(z) <- seq_along(time(z))#sequential data as 1,2,3,4,...

ts_chile = as.ts(z) #conversion into time series data structure

head(ts_chile)

- Date as a time index with NA values to be removed.

`library(zoo)`

library(tseries)#for removing na from ts()

z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates

ts_chile = as.ts(z)

ts_chile = na.remove(ts_chile)#remove na from time series

head(ts_chile)

If you convert here, like this it will create weird numbers. I prefer to do it as 1,2,3,4,… time index.

# Plot the time series

`plot(ts_chile)`

Two parts of the Time Series, we’ll discuss are

- Decomposition (Trend + Seasonal + Error)
- Forecasting

# Decomposition

`decompose(ts_chile)`

This is happening because the data has no seasonal trend

# Forecast

# Model (ARIMA model)

`library(forecast)`

model = auto.arima(ts_chile)

model

# ACF Plot

`acf(model$residuals, main = "Correlogram")`

# PACF Plot

`pacf(model$residuals, main = "Partial Correlogram")`

# Ljung Box Test

The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the autocorrelations for the errors or residuals are non zero.

`Box.test(model$residuals, type = "Ljung-Box")`

We do not reject the hypothesis, that the model shows a good fit.

# Normality of Residuals

`hist(model$residuals, freq = FALSE)`

lines(density(model$residuals))

# Forecast

`forecast = forecast(model,4)#4 = number of units you want to `

library(ggplot2)

autoplot(forecast) #plot of the model

`accuracy(forecast) #performance of the model`

`forecast #forecast values`

**Exercise**: Apply it to other countries as well.

All the best. :D