A new way to find high net worth individuals?

Thomas Peubey
Doorda
5 min readJun 12, 2018

--

Welcome to the first, in what I hope, will be a regular series helping all you DataTrekers out there uncover new ways of gaining the edge over your competitors, be they at college, or another company!

Like many of my fellow analysts, I have a deep interest in how data can be deployed to provide new insights. At Doorda, we’re always on the hunt for new sources we can add to our already extensive catalog.

However, we must first ask the question, is there any value in it? Generally speaking there is rarely any value in a single dataset, it’s only when you combine them you get to see the real value. Luckily working at Doorda I have access to a range of pre-processed data and advanced tools, so we can answer the question very quickly.

For the benefit of this article I’m going to work on the basis that you’re starting from scratch. To keep us focused I’m going to ask a simple question:

Where do the culture vultures live?

Why? It demands that we know household expenditure along with how much they typically spend. It’s also a very effective way of finding where the well educated, high net worth individuals tend to hang out.

Data Used

For this project we used the datasets listed below. Most of these we have access to in our database. However, for the benefit of this article I’ll take you through the process of acquiring, cleaning and mapping the data.

Tools Used

  • R Studio
  • My trusty Mac

I’ll mainly be using R as it’s my go to language for processing and analysing data.

When you extract data from public websites you also need to check the reuse licence, luckily for us most of the data we’re using is released under Open Government Licence, so we can fill our boots.

Acquiring Data

All 3 datasets can be acquired from the link above and can be imported to R as is, other than the Weekly Expenditure data, which will require conversion from xls to csv.

library("data.table")
library("rgdal")
library("plyr")
library("ggplot2")
library("ggmaps")
library("sp")
# Weekly Expenditure Data
household_expenditure <- fread("/home/ubuntu/household_expenditure.csv")
# OAC Cluster and Names
oac_cluster = fread("/home/ubuntu/2011 OAC Clusters and Names csv v2.csv")
# OA Boundaries Polygon (SHP file)
oa_bd <- readOGR(“./oa_boundaries”, “Output_Area_December_2011_Full_Extent_Boundaries_in_England_and_Wales”)

The Weekly Expenditure with 16 different categories:

1. Food and Non-Alcoholic Drinks
2. Alcoholic Drink, Tobacco and Narcotics
3. Clothing and Footwear
4. Housing, Fuel and Power
5. Household Goods and Services
6. Health
7. Transport
8. Communication
9. Recreation and Culture
10. Education
11. Restaurants and Hotels
12. Miscellaneous Goods and Services
13. All Expenditure Groups
14. Other Expenditure Items
15. Total Expenditure
16. Weekly Expenditure per Person

Data Process

Next step, we have to process the data in order for it to be usable. We first join the expenditure data with the Output Area Classification data. To do this, we need to replace spaces with underscores for the Group Name and then join it with the expenditure data.

  1. Match weekly expenditure to OAC cluster names dataset by OAC group (i.e. Farming_Communities_1A, Rural_Tenants_1B)
# Create Group Name in OAC to match group name in expenditure data
oac_cluster$oac_group = paste(gsub(" ", "_", oac$`Group Name`), toupper(oac$`Group Code`), sep = "_")
# Left join on oac_cluster and household_expenditure
oac_cluster_expenditure = join(oac_cluster, household_expenditure, by = "oac_group", type = "left", match="first")

Once we have assigned the values in the expenditure data to each Output Areas, we will need to add that data to the Spatial Polygon dataframe we imported using the SHP file earlier. We also need to convert the Spatial data to a normal dataframe so that it can be used by ggmaps and ggplot2 library.

2. Add expenditure data to Spatial Data and convert to dataframe

# Standardise oa naming 
colnames(oac_cluster_expenditure)[2] = "oa11cd"
# Merge data with Spatial Polygon Dataframe
full_data = merge(oa_bd, oac_cluster_expenditure, by="oa11cd", sort=FALSE)
# Convert Easting Northing to Lat Lon
full_data_transformed <- spTransform(full_data, CRS("+proj=longlat +datum=WGS84"))
# Converts Spatial data to dataframe that can be plot on a map
full_data_transformed@data$id = rownames(full_data_transformed@data)
full_data_transformed = fortify(full_data_transformed, region="id")full_data.df = join(full_data_transformed.points, full_data_transformed@data, by="id")

Visualising

Now that we have the data cleaned and transformed our data, we can start visualising. We will be plotting the polygons we imported earlier from the SHP file using ggplot2 and ggmaps library. Let’s see how the spread of weekly expenditure per person looks like in North West and South East London. (Warning: the spatial data consists of 181408 polygons which will require a significant amount of memory to render, so create a subset of the data instead)

# Function to Round Up values
RoundUp <- function(from, to) ceiling(from / to) * to
# Round up weekly expenditure to the nearest £100 and convert to factor
full_data.df$weekly_expenditure_per_person = as.factor(RoundUp(as.integer(full_data.df$weekly_expenditure_per_person), 100))
# Get the map of the area using the lat lon of the data
guk <- get_map(location=c(lat=mean(full_data.df$lat), lon=mean(full_data.df$long)))
# Plot on Map
ggmap(guk) +
geom_polygon(aes(lon=long, lat=lat, group=group, fill=weekly_expenditure_per_person),
data = full_data.df,
alpha = 0.8,
color = "white",
size = 0.2)+
geom_path(color="white") +
coord_equal() +
scale_fill_brewer("Weekly Expenditure per Person (£)", palette="YlOrRd")

Results

With the data mapped, we are able to identify areas in NW London and SE London with high expenditure (specifically in areas like Kensington and Westminster). This process can also be used to identify the expenditure of various items in different areas.

Conclusion

It is relatively simple to plot the Census data on a map, but the real power/insight can be obtained when combining Census data with other geographical data like property prices, age groups, ethnicity etc. Also, since this example only maps the data to Output Area level (175, 434 OAs), at certain locations, we are not able to have more granularity in the insights.

At Doorda we have mapped all our statistical data to a postcode level, which means that you can have more granular insights in your geographical analysis.

Our next article will be on location analysis, how can you tell if an area is getting wealthier or poorer? You’ll be surprised what data we use to answer the question.

--

--

Thomas Peubey
Doorda
0 Followers
Writer for

Interested in computers since I was very young. Especially in Big Data and Finance, two areas that are part of my training. I also enjoy boxing and music.