Geek Culture
Published in

Geek Culture

Detailed Explanation of Panel Data — How to identify Balanced and unbalanced Panel Data.

Photo by ThisIsEngineering from Pexels

A panel dataset is referred to as longitudinal data or cross-sectional time-series dataset, which, ideally, provides repeated measurements of a certain number of variables over a period on observed units, such as individuals, households, firms, cities, and states.

Time-series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only. Time series can be described as a collection of observations for a single entity at different time intervals and cross-sectional data as a collection of observations for multiple entities at a single point in time.

To identify to which category the dataset belongs (that is panel data, time series, or cross-sectional ) is to find out an identifier that uniquely distinguishes one data record from the other records. If the identifier is a time data field then the data set belongs to time series. Whereas if the data records can be uniquely identified with time data filed and along with an identifier that is non-time related like employee id, student id, airline code, firm code, country code, etc then the data set is panel data. If the data records can be uniquely identified with the non-time identifier, then the data set is cross-sectional data.

In a panel dataset, the number of repeated measurements on the same variables on the same population or sample can be as small as two. A panel data is easily conceptualized as a three-dimensional structure for each variable: the vertical dimension as time and the horizontal dimension as multiple observations for each variable.

Here I am taking a panel data set that is used to predict the cost for U.S. Airlines that contains 5 predictors and 1 response and the dataset can be downloaded from here. It has a total of 90 observations on 6 airline firms for a period of 15 Years (1970–1984).

The snippet of the first 9 records of DataSet

Predictors

I (Airline): I indicate different airline firms and here we are having 6 airline firms from 1 to 6.

We can validate the unique values of the airline firm by using the UNIQUE() function. Here the number of observations is 90, which starts from the 2nd row to ends in the 91st row. So we can use function as below

To identify the unique value of the airline firms

The result contains values from 1 to 6, so we confirm the dataset has only values from 1 to 6 for the airline firm.

T (Year): T indicates the year for which cross-sectional data is being observed. Here the duration of T is from 1970 -1984 i.e., for 15 years.

The number of years of the dataset can be also be validated as airline firm, I.

To identify the unique value of years of observations

Here result contains values from 1 to 15, so we can confirm the dataset contains observations for 15 years.

PF (Fuel Price): The index and price data show the global average price paid at the refinery for aviation jet fuel on the reported date. Fuel Price is always determined by a contract negotiated between an Airline firm and the aviation fuel provider. It is usually a multi-year contract with terms and conditions that determine the price irrespective of the market fluctuation.

LF (Load Factor): It is a measure of how much of an airline’s passenger carrying capacity has been utilized. It generally depends upon flight seating capacity, airline route, demand, etc.,

Q (revenue passenger miles): A revenue passenger mile (RPM) is a transportation industry metric that shows the number of miles traveled by paying passengers and is typically an airline traffic statistic.

Response

Cost (1000, the data): Depends on various factors including Fuel Price, Load Factor, Lease & Depreciation, Aircraft Maintenance, Labor, and Airport Handling Charges.

Dataset Analysis

For this dataset, we have a time data field that year, but with the help of this identifier, we are not able to uniquely identify one data record from other. Here we require one more identifier which is non-time related that is I, airline firm number. So we can confirm our dataset is Panel Data.

Balanced and UnBalanced Panel Data

Panel data can be balanced or unbalanced. In a balanced panel, all panel members (cross-sectional data) have measurements in all periods, or each panel member is observed every year. If a balanced panel contains N panel members and T periods, the number of observations (n) in the dataset is necessarily n = N×T. Whereas for an unbalanced panel, each panel member in a data set has different numbers of observations, or at least one-panel member is not observed every period. If an unbalanced panel contains N panel members and T periods, then the following strict inequality holds for the number of observations (n) in the dataset: n < N×T.

For our dataset number of observations, n = 90, number panel, N = 6 and Time period, T = 15.

which is the same as n, dataset number of observation

Here n = N×T, so our dataset is a balanced panel data. We can also confirm it by using a contingency table or cross-table. If any of the cells in the table have a frequency less than observed, then it indicates the dataset is unbalanced, for our case we should get the same frequency for all 6 airline firms.

To create a contingency table, in the CSV or excel file go to the insert table and select PivotTable. Once it is selected create PivotTable window will pop up, it will automatically select the entire data, in case it is not selected, select the entire data by clicking on the up arrow of Table/Range.

Can select a New Worksheet or Existing Worksheet to place the PivotTable, here I have selected the existing worksheet and for location, you can select an empty cell.

After that press, OK and an empty PivotTable will be created at cell O10 and a PivotTable field on the right-hand side of the sheet which contains all columns and the empty filters.

Here we have ‘I’ which represents the Airline firm as the nontime-related identifier, so drag it to Row and just select the time identifier ‘T’ which represents the year. Once it is done then you should get as below

Now double click on ‘Sum of T’ in the pivot table, then a window will pop up as below.

Select Count instead of Sum, then you can see the pivot table get updated as below.

From this we can conclude that each airline has a total of 15 observations, that is one for each year and our dataset is balanced panel data.

Advantages of Panel Data

Panel data can model both the common and individual behaviors of groups. It contains more information, more variability, and more efficiency than pure time-series data or cross-sectional data and can detect and measure statistical effect which other data set fails.

Thanks for reading!!!! If this article was helpful to you, feel free to clap, share and respond.

If you are interested to check detailed econometric analysis of panel data can check it from here.

--

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Recommended from Medium

How to Fail a Data Science Interview Successfully!

How I Got 4 Data Science Offers and Doubled my Income 2 Months after being Laid Off

The Dark Side Of Data Science: The Dangers Of Sentiment Analysis

Topic Modelling for Absolute Beginners

ABC’s to Logistics Costing - What every Buyer/Logistics Professional should know

Case Study: Real-Time Rain Monitoring during Heavy Rain Event in France

The unlikely technical author

Data Science at Instacart

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sandhya Krishnan

Sandhya Krishnan

Data Quality Engineering Manager — Certified Data Scientist from IIM Calcutta. Visit my Kaggle Profile @ https://www.kaggle.com/sandhyakrishnan02

More from Medium

Technical: Even linear regression can escape the bias-variance tradeoff

All About Confusion Matrix in Machine Learning — Theory & Code

Relative Importance of Variables for Explainable AI: Shapley Values and Linear Regression

Why Accuracy is not a good measure all classification problems?