# COVID19’s effect on airport traffic: A Data Analysis in R

`data = read.csv("location/filename.csv")`

# Observing the data

`head(data) tail(data)`

# Variable Types

`str(data)`

# Understanding the Data

## Numerical Variables

• PercentOfBaseline

## Categorical Variables

• Date
• AirportName
• Centroid (of airports)
• City
• State
• Country
• Geography (of countries)

• This is time-series data.
• The centroid and geography will be only helpful if we want to visualize the data on the world map.
• Country → State → City → AirportName are factor variables inside the next one.

# Focus

• Data Visualization is the main focus of the project.
• Due to the structure of Country → State → City → AirportName, the best way to visualize it in an app with dropdown options. (R Shiny App, Tableau, Power BI, etc.)
• Analyze each of the individual time series and forecast for the next “n” number of days.
• Data visualization of categorical variables and their relationship is not important, because they are interrelated and give no valuable insight.

# Univariate (Kernel Density of PercentOfBaseline)

`library(ggplot2)ggplot(data, aes(x = PercentOfBaseline, fill = "red"))+geom_density(alpha = 0.2)`

Exercise: Plot individual kernel density of each factor of Country, State, City, and AirportName.

# Numerical vs Categorical

## - PercentofBaseline vs Country

`ggplot(data, aes(x = PercentOfBaseline, fill = Country))+geom_density(alpha = 0.2)`

## Observations:

• All the countries are behaving like normal mixtures with multiple modes.
• Australia has two distinct modes of normal mixtures. (Why?)
• Canada has a higher PercentOfBaseline than most of the other countries.

Exercise: Plot and compare kernel density of PercentOfBaseline of factors of Country, State, City, and AirportName.

# - PercentofBaseline vs Date (Time Series)

`ggplot(data, aes(x =Date, y = PercentOfBaseline, group = Country, colour = Country)) + geom_line() +facet_wrap(~ Country)`

## Observations:

• Except for Australia, the other countries have an increasing trend.
• Australia has two distinct modes of normal mixtures because, over time, the PercentOfBaseline has a major shift.

Exercise: Plot and compare the time series of PercentofBaseline of Country, State, City, and AirportName.

# Time Series Forecast

Time Series is a specific data structure in R.

We have to convert data into time series data structures to apply time series algorithms.

# Chile Data

`library(dplyr)data_chile = data %>% filter(Country == "Chile") chile = data_chile[,c(2,5)]head(chile)`

# Convert into time-series data structure*

• 1,2,3,4,… as time index with no NA values.
`library(zoo)z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for datestime(z) <- seq_along(time(z))#sequential data as 1,2,3,4,...ts_chile = as.ts(z) #conversion into time series data structurehead(ts_chile)`
• Date as a time index with NA values to be removed.
`library(zoo)library(tseries)#for removing na from ts()z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for datests_chile = as.ts(z) ts_chile = na.remove(ts_chile)#remove na from time serieshead(ts_chile)`

If you convert here, like this it will create weird numbers. I prefer to do it as 1,2,3,4,… time index.

# Plot the time series

`plot(ts_chile)`

Two parts of the Time Series, we’ll discuss are

• Decomposition (Trend + Seasonal + Error)
• Forecasting

# Decomposition

`decompose(ts_chile)`

This is happening because the data has no seasonal trend

# Model (ARIMA model)

`library(forecast)model = auto.arima(ts_chile)model`

# ACF Plot

`acf(model\$residuals, main = "Correlogram")`

# PACF Plot

`pacf(model\$residuals, main = "Partial Correlogram")`

# Ljung Box Test

The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the autocorrelations for the errors or residuals are non zero.

`Box.test(model\$residuals, type = "Ljung-Box")`

We do not reject the hypothesis, that the model shows a good fit.

# Normality of Residuals

`hist(model\$residuals, freq = FALSE)lines(density(model\$residuals))`

# Forecast

`forecast = forecast(model,4)#4 = number of units you want to library(ggplot2) autoplot(forecast) #plot of the model`
`accuracy(forecast) #performance of the model`
`forecast #forecast values`

Exercise: Apply it to other countries as well.

All the best. :D

--

--