Data Analysis of Bank of England/NMG Survey of Household Finances — Part I

Melih Ekinci
9 min readJul 11, 2022

--

1. Data Preparation

Information and files regarding the dataset can be found at https://www.bankofengland.co.uk/statistics/research-datasets

Data Source: 2011–2021 NMG Research Surveys

Survey period:2011- 23–29 September 2011
2012- 12 September — 3 October 2012
2013- 12–30 September 2013
2014- 3–24 September 2014
2015- 2–22 September 2015
2016- 31 August-19 September 2016
2017- 6–26 September 2017
2018- 5–26 September 2018
2019- 4–24 September 2019
2020- 25 August-15 September 2020
2021- 1–21 September 2021

1.1 Importing data

In order to prepare a dataset of UK household income survey combined across several years, necessary libraries and excel file must be imported firstly. After importing the excel file, each sheet must assigned to a dataframe. Lastly, we will concat these dataframes into together to have a single dataframe contanining data for the years between 2011–2021.

Below, I imported the libraries that is needed in this work.

import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
import os
import warnings
import seaborn as sns
warnings.filterwarnings("ignore")

Changing the working directory to the file folder of choice.

#Changing working directory to a designated file
os.chdir(r"C:\Users\fimmepe\Desktop\BoE")
print("Directory changed")

Importing the excel file and listing sheet names.

#reading the data
xlsx = pd.ExcelFile('boe-nmg-household-survey-data.xlsx')
#listing sheet names in the data
SheetList=xlsx.sheet_names
SheetList

Next, data frames for each sheet is created using a dictionary.

SheetDictionary={}
#creating datasets for each sheet using dictionary
for i in range(len(SheetList)):
SheetDictionary[SheetList[i]]=xlsx.parse(i)
#getting sheet names into a list.
keys=list(SheetDictionary.keys())

Let’s add a column indicating the year of survey, also look at the number of features for each year. It is important to see whether each data frame has the same features and same number of features. Furthermore, names of these features should be written same in order to concatenate datasets properly. So, for the next step I’ll check two things:

  1. Number of features
  2. Whether feature names are same across datasets
#adding a column indicating the year of survey for each data set
for i in range(keys.index('2011'),len(keys)):
SheetDictionary[keys[i]]["year"]=int(keys[i])
print('For the year '+ str(keys[i]) +', number of features is '+ str(len(SheetDictionary[keys[i]].columns))+'.')

So the number of features is different for the year 2021. Let’s check whether the features are same for the years 2011–2020. We know that their number of features are same. My next step will be getting all feature names in a dataset so I can compare them. It’s really important to strip and lower case features name before comparing them.

column_names= pd.DataFrame()
for i in range(keys.index('2011'),len(keys)):
if keys[i]!='2021':
column_names[keys[i]]=list(SheetDictionary[keys[i]].rename(str.lower, axis='columns').columns.str.strip())
else:
column_names[keys[i]]=pd.Series(list(SheetDictionary[keys[i]].rename(str.lower, axis='columns').columns.str.strip()))

Here I created a column based a condition to test whether all values are equal across 2011–2021. If any feature name is different in any year given a row, the column (Test1) will show 1 as a result.

column_names['Test1'] = column_names.apply(lambda x: 0 if x['2011'] ==
x['2012'] and x['2011'] ==x['2013'] and x['2011'] ==x['2014'] and x['2011'] ==x['2015'] and x['2011'] ==x['2016'] and x['2011'] ==x['2017'] and x['2011'] ==x['2018']
and x['2011'] ==x['2019'] and x['2011'] ==x['2020'] else 1, axis=1)

1.2 Merging dataframes

So, we know that feature names and number are same for the years 2011–2020. For the year 2021, there might be additional features. In order to concatenate these datasets into one single dataset, we can use pd.concat function to match features accordingly. It’s important to keep feature names stripped and lower cased.

#merging all data sets across years
df_merged=SheetDictionary[keys[keys.index('2011')]].rename(str.lower, axis='columns').rename(str.strip, axis='columns')
for i in range(keys.index('2012'),len(keys)):
df_merged=pd.concat([df_merged, SheetDictionary[keys[i]].rename(str.lower, axis='columns').rename(str.strip, axis='columns')],axis=0, ignore_index=True)

1.3 Data Quality

Since we merged datasets properly, we can now try to understand quality of this dataset. For example, looking at missing value ratios for each column can be intuitive.

Below, I created a dataframe which has the features in dataset as index. Number of missing values and its ratio to total number of observations is listed.

null=df_merged.isnull().sum().sort_values(ascending = False)df_null = pd.DataFrame(data=null, columns=['null_number'])df_null["null_ratio"]=df_null["null_number"]/59175*100

Also, I will look at number of features which has 0 missing values. There are only 24 features out of 1924 columns which has no missing observation.

df_null[df_null.null_ratio==0]

I will drop all features that have no observations in it.

df_merged2=df_merged[df_null[df_null.null_ratio!=100].index]
df_merged2.head()

We have left with 1461 features, now we can focus on the features that is topic of analysis: household income and other 4–5 variables.
Features related to household income are qincome,qincome_v1,qincome_v2,qincome_free,qincomefree_v2. Let’s check how these features evolve across years.

As there are several variables related to income in data, I wanted to check all variables to see whether we can use them across 2011–2021 meaningfully.

df_merged2["qincomefreev2_calc"]=df_merged2["qincomefreev2_n_1"]+df_merged2["qincomefreev2_n_2"]+df_merged2["qincomefreev2_n_3"]#var_income=list(df_merged2.columns[df_merged2.columns.str.contains("income")])
var_income=["qincome","qincomev1","qincomev2_1_1","qincomefree_1","qincomefreev2_n_1","qincomefreev2_n_2","qincomefreev2_n_3","qincomefreev2_calc","hincomev2_1","hincomev2_2","hincomev2_3","incomev2comb","combincome"]
var_income.append("year")
var_income.sort()

For the year 2011 and 2012, there is only categorical income feature available for analysis. For other years, there are both categorical and continuous features related to income. For comparison between 2011 and 2021, I have to use categorical ones. I will use 2015 and afterwards to see trend of household income.

Now, I want to look at availability of social variables such as working status, age group, gender and marital status across years.

df_merged2[["dage","fgen","qmarital","workingstatus","a20workingstatus","qworking","dworking","a20dworking","dhousing","year"]].groupby(by="year").mean()

I want to look at savings of these individuals, too.

var_saving=list(df_merged2.columns[df_merged2.columns.str.contains("be2a")])
var_saving.append("year")
df_merged2[var_saving].groupby(by="year").mean()

Let’s look at columns which as missing observation less than 30%

df_merged3=df_merged[df_null[df_null.null_ratio<30].index]
df_merged3.columns

2. Data Analysis

I will choose categorical and continuous income variables, age,housing, monthly savings and expectations over economy to analyze in this work.

Since categories for income variables are different for 2011, I will make 2 graphs looking at distributions separately.

# Created a new variable maps explanation of values using data description.
df_merged2["qincome_mapped"]=df_merged2["qincome"].map({
1: 'Up to £4,499',
2: '£4,500 - £6,499',
3:'£6,500 - £7,499',
4:'£7,500 - £9,499',
5:'£9,500 - £11,499',
6:'£11,500 - £13,499',
7:'£13,500 - £15,499',
8:'£15,500 - £17,499',
9:'£17,500 - £24,999',
10:'£25,000 - £29,999',
11:'£30,000 - £39,999',
12:'£40,000 - £49,999',
13:'£50,000 - £74,999',
14:'£75,000 - £99,999',
15:'£100,000 or more',
97:'Unknown/Prefer Not to State',
96:'Unknown/Prefer Not to State'})

Plotting distribution of household income in 2011.

#df_merged2[df_merged2.year==2011]["qincome_mapped"].hist(alpha=0.5)
plt.rcParams.update({'font.size': 13})
income_category=['Up to £4,499',
'£4,500 - £6,499',
'£6,500 - £7,499',
'£7,500 - £9,499',
'£9,500 - £11,499',
'£11,500 - £13,499',
'£13,500 - £15,499',
'£15,500 - £17,499',
'£17,500 - £24,999',
'£25,000 - £29,999',
'£30,000 - £39,999',
'£40,000 - £49,999',
'£50,000 - £74,999',
'£75,000 - £99,999',
'£100,000 or more',
'Unknown/Prefer Not to State']
fig = plt.figure(figsize=(10, 6), dpi=80)
df_merged2[df_merged2.year==2011]["qincome_mapped"].value_counts().reindex(income_category).plot(kind='bar',color="purple")
plt.xticks(rotation='vertical')
plt.xlabel("Income")
plt.ylabel("Number of individuals")plt.title('Distribution of Household Income in 2011')
plt.show()
#Looking at distributions for 2016 and 2021.
order=['Less than 0',
'0',
'£1 – £2,499',
'£2,500 - £4,499',
'£4,500 - £6,499',
'£6,500 - £7,499',
'£7,500 - £9,499',
'£9,500 - £11,499',
'£11,500 - £13,499',
'£13,500 - £15,499',
'£15,500 - £17,499',
'£17,500 - £24,999',
'£25,000 - £34,999',
'£35,000 - £44,999',
'£45,000 - £54,999',
'£55,000 - £74,999',
'£75,000 - £99,999',
'£100,000 or more',
'Unknown/Prefer Not to State']
fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.countplot(x="incomev2comb_mapped", hue="year", data=df_merged2[(df_merged2.year==2016) | (df_merged2.year==2021)], order=order,palette='inferno')
plt.xticks(rotation='vertical')
ax.set_xlabel("Income", fontsize = 12)
ax.set_ylabel("Number of Individuals", fontsize = 12)
plt.legend(title='Year',loc='upper left')
plt.title("Distribution of Household Income in 2016 and 2021")
plt.show()

It can be said that distribution of income generally stays the same between the years 2016 and 2021. However, there is a noticable increase in right tail of the distribution, indicating that percentage of income group > 75,00 sterlin increased more compared to other income groups.

Now, let’s see trend of average the household income between the years 2015–2021.

df_merged2[df_merged2.year>2014][["qincomefreev2_calc","year"]].groupby(by="year").mean()

Plotting average household income across 2015 and 2021.

data=df_merged2[df_merged2.year>2014][["qincomefreev2_calc","year"]].groupby(by="year").mean()
fig, ax = plt.subplots(figsize=(8, 5))
sns.lineplot(data=data, x="year", y="qincomefreev2_calc",marker='*',color='#965786')
ax.set_ylim(50000, 65000)
ax.set_xlabel("Year", fontsize = 12)
ax.set_ylabel("Average Household Income (£)", fontsize = 12)
ylabels = ['{:,.0f}'.format(x) for x in ax.get_yticks()]
ax.set_yticklabels(ylabels)
# label points on the plot
for x, y in zip(data.index, data['qincomefreev2_calc']):
# the position of the data label relative to the data point can be adjusted by adding/subtracting a value from the x &/ y coordinates
plt.text(x = x, # x-coordinate position of data label
y = y-150, # y-coordinate position of data label, adjusted to be 150 below the data point
s = '{:,.0f}'.format(y), # data label, formatted to ignore decimals
color = 'white').set_backgroundcolor("#965786") # set colour of line
plt.show()
  • According to survey, average income for households in U.K were increasing between the period 2015–2020.
  • During the pandemic, U.K GDP fell by 9.8% due to lockdown measures.
  • Similarly, average household income in 2020 fell by nearly 7.1% in survey data.
fig, ax = plt.subplots(figsize=(8, 5))
data2=df_merged2[df_merged2.year>2014]
ax = sns.boxplot(x="year", y="qincomefreev2_calc", data=data2)
ax.set_xlabel("Year", fontsize = 12)
ax.set_ylabel("Household Income (£)", fontsize = 12)
ax.set_ylim(0, 200000)
plt.show()

Median of household income remains stable at £50,000 between the years 2015 and 2021.

Mean household income was increasing until 2020, it rebounds back its previous levels in 2021.

Let’s look at this trend in terms of age group.

#mapping age groups into a variable
df_merged2["dage_mapped"]=df_merged2["dage"].map({
1: '18-24',
2: '25-34',
3: '35-44',
4: '45-54',
5: '55-64',
6: '65+'})
fig, ax = plt.subplots(figsize=(8, 5))
ax.set_xlabel("Year", fontsize = 12)
ax.set_ylabel("Average Household Income (£)", fontsize = 12)
palette = sns.color_palette("hls", 6)
sns.lineplot(data=df_merged2[df_merged2.year>2014][["qincomefreev2_calc","year","dage_mapped"]].groupby(["year","dage_mapped"]).mean(), x="year", y="qincomefreev2_calc",hue="dage_mapped",palette=palette)
ax.set_ylim(20000, 80000)
ylabels = ['{:,.0f}'.format(x) for x in ax.get_yticks()]
ax.set_yticklabels(ylabels)
plt.legend(title='Age',loc='lower left')
plt.title("Average Household Income by Age")
plt.show()

Individuals decide housing standards based on their income. I want to see how these groups changed in the past years.

#mapping housing groups into a variable
df_merged2["dhousing_mapped"]=df_merged2["dhousing"].map({
1: 'Owned Mortg',
2: 'Owned Outright',
3: 'Rent LA',
4: 'Rent Private',
5: 'Other/DK/Ref'})
#preparing line plot
fig, ax = plt.subplots(figsize=(8, 5))
ax.set_ylabel("Average Household Income (£)", fontsize = 12)
palette = sns.color_palette("hls", 4)
ax=sns.lineplot(data=df_merged2[(df_merged2.year>2014)][["qincomefreev2_calc","year","dhousing_mapped"]].groupby(["year","dhousing_mapped"]).mean(), x="year", y="qincomefreev2_calc",hue="dhousing_mapped",palette=palette)
ax.set_xlabel("Year", fontsize = 12)
ax.set_ylabel("Average Household Income (£)", fontsize = 12)
ylabels = ['{:,.0f}'.format(x) for x in ax.get_yticks()]
ax.set_yticklabels(ylabels)
plt.legend(title='Housing',loc='lower left')
plt.title("Average Household Income by Housing")
plt.show()

It can be said that people who lives on rent affected by the pandemic more.

Below, I plotted figures related to expectations of individuals about economy and monthly saving rate.

df_merged2["mq08_mapped"]=df_merged2["mq08"].map({
1: 'Get a lot better',
2: 'Get a little better',
3: "Stay the same",
4: "Get a little worse",
5: "Get a lot worse",
6: "Don't know / Prefer not to state",
7: "Don't know / Prefer not to state"})
fig, ax = plt.subplots(figsize=(9, 6))
ax.set_ylabel("Average Household Income (£)", fontsize = 12)
palette = sns.color_palette("hls", 6)
ax=sns.lineplot(data=df_merged2[(df_merged2.year>2014)][["qincomefreev2_calc","year","mq08_mapped"]].groupby(["year","mq08_mapped"]).mean(), x="year", y="qincomefreev2_calc",hue="mq08_mapped",palette=palette)
ax.set_xlabel("Year", fontsize = 12)
ax.set_ylabel("Average Household Income (£)", fontsize = 12)
ylabels = ['{:,.0f}'.format(x) for x in ax.get_yticks()]
ax.set_yticklabels(ylabels)
plt.legend(title='Expectations About Economy',loc='upper right')
plt.title("Average Household Income by Expectations About Economy Over the Next 12 Months")
plt.show()
fig, ax = plt.subplots(figsize=(8, 5))palette = sns.color_palette("hls", 4)
ax=sns.lineplot(data=df_merged2[df_merged2.year>2014][["qincomefreev2_calc","year","be2afree_1"]].groupby(["year"]).mean(), x="year", y="qincomefreev2_calc",palette=palette)
ax.set_ylabel("Average Household Income (£)", fontsize = 12)
ax.set_xlabel("Year", fontsize = 12)
ylabels = ['{:,.0f}'.format(x) for x in ax.get_yticks()]
ax.set_yticklabels(ylabels)
ax2 = ax.twinx()
ax1=sns.lineplot(data=df_merged2[df_merged2.year>2014][["qincomefreev2_calc","year","be2afree_1"]].groupby(["year"]).mean(),ax=ax2, x="year", y="be2afree_1",color='y')
ax1.set_ylabel("Average Monthly Saving (£)", fontsize = 12,color='y')
ylabels = ['{:,.0f}'.format(x) for x in ax1.get_yticks()]
ax1.set_yticklabels(ylabels)
plt.title("Average Household Income and Average Monthly Saving")
plt.show()

3. Conclusions & Future Work

  • Income levels mainly concentrates between £17,500–50,000 in 2011, 2016 and 2021.
  • There is a noticable increase in the right tail of distribution between 2016 and 2021, indicating that share of people earning more than £75,500 has increased.
  • Age group between 18–24 is the most affected by the pandemic as the decrease in their income is the most.
  • Individuals who owns a house through mortgage has higher income on average. It can be said that people who lives on rent affected by the pandemic more.
  • Monthly saving amount decreased between 2017 and 2019. It can be said that savings increased in 2020, due to lockdown measures.
  • Individuals who earn more have more optimistic expectations about the economy for the upcoming years.

For better conclusions, income distributions should be analyzed as well. Also, calculations can be repeated by including individuals who did not report his income.

--

--

Melih Ekinci

Data Analyst | MSc Candidate in Artificial Intelligence