Importing and Cleaning up from SPSS Survey Data to Power BI (with R script or SPSS syntax)

Filip Mitrovic
Analytics Vidhya
Published in
7 min readOct 11, 2021

Data from household surveys is usually presented in table after table of useful information with important distinctive background characteristics, like wealth of the respondents, whether they live in a urban or a rural area, . region where they live, age, gender or ethnicity.

Problem is that most users find tables less friendly and easy to explore than if data was visualized. Most surveys, including MICS have started producing attachments to their main findings reports, published in tables in a form of various visualizations. For MICS, these would be Snapshots, a leaflet or a booklet form with data presented in graphs, grouped in topics with added key messages to provide some context to the data.

These are moves in good direction, but data can also easily be presented in a dynamic way, using some of the available tools for data visualization like Tableau or Power BI. Instead of showing fixed graphs to users, they can freely explore data using interactive dashboards.

As in other cases, every survey program has a standard for using certain statistical analysis package. MICS surveys usually create up to seven datasets from each survey taken, and produce datasets into in a SPSS ‘*.sav’ format. SPSS is not open source and various licenses can get expensive.

Tableau can read the SPSS files automatically, and there are no issues in importing them to the platform. Power BI on the other hand cannot. This is a shame really as representing complex household survey data in an interactive manner should be a priority for such a platform, but there are of course multiple work arounds for the current incompatibility between SPSS and Power BI.

To import data in Power BI, if a user has SPSS software license they can simply run a syntax to convert *.sav file to a compatible format, clean up data (remove variables that are not going to be used in the dashboard or calculate additional ones that are not in the original dataset). Other method this post described with code attached is a really cool option to run R scripts in Power BI and use R code to import SPSS survey data into Power BI platform.

Now, in my experience, both Power BI and Tableau are great for visualizing data. Both packages do have their own language (DAX and VizQL respectively) that can be used for additional manipulation and analysis of imported data. However, this analysis is a bit clunky and platforms like SPSS or R are way more elegant way to do any analysis. Doing calculations in DAX, for example, especially on large datasets, causes dashboards to load slower also, and to an extent this happens in Tableau as well. In my experience it is best to just prepare needed dataset as much as possible in data analysis package and upload only variables in a dataset that will actually be used to data visualization platforms, rather than to have the whole file there, or create additional calculated variables, especially if there will be a lot of them.

For this exercise, the snapshot made using MICS Snapshot template for Tonga MICS 2019 survey are reproduced, focusing on the first snapshot in the series showing information on survey and sample characteristics. As suggested before only variables needed to re-create this snapshot as a interactive dashboard in Power BI or Tableau will be imported, and additional ones (like completion rates for different questionnaires) will be calculated prior to import. Please note datasets in MICS surveys only show values from questionnaires, without any additional calculations for the most cases. In this post, will only be focusing on data cleaning and in the following one will show how these dashboards in both Power BI and Tableau are actually created.

Importing using SPSS

This method would require users to have SPSS installed. As suggested user would trim down the file and calculate all additional variables that are not in the original dataset in SPSS and only then make the import. Syntax then creates an excel file (this could be some other type of data format, like a ‘*.csv’ file) that can be easily imported in the Power BI platform. Syntax, with some additional comments is as follows:

cd 'C:\path\file'.* Call include file for the working directory and the survey name.
* Open the household data file.
get file = 'hh.sav'.
* Give value 1 to each HH to calculate total number of hhs sampled.
compute sampled = 1.
* Give value 1 to each HH occupied to calculate total number of hhs occupied.
recode HH46 (1,2,4,7 = 1) (else = 0) into occupied.
* Give value 1 to each hh interviewed to calculate total no of interviewed HHs.
recode HH46 (1 = 1) (else = 0) into complete.
recode HH52 (1 thru hi = 1)(else = 0) into HH52A.* Give value 1 to each hh selected for water quality test.
recode HH9 (1 = 1) (else = 0) into wqSelect.
* exclude cases with incomplete household interview.
if HH46 <> 1 wqSelect = 0.
* Give value 1 to each water quality test completed at household.
recode WQ11 (1 = 1) (else = 0) into wqhhcomplete.
* Give value 1 to each water quality test completed at source.
recode WQ19 (1 = 1) (else = 0) into wqsocomplete.
* Give value 1 to highest each wm questionnaire eligible.
recode HH49 (sysmis = 0) (else = copy) into wmTot.
* Give value 1 to each wm questionnaire completed.
recode HH53 (sysmis = 0) (else = copy) into wmComp.
* Give value 1 to highest each mn questionnaire eligible.
recode HH50A (sysmis = 0) (else = copy) into mnTot.
* Give value 1 to each mn questionnaire completed.
recode HH53 (sysmis = 0) (else = copy) into mnComp.
* Give value 1 to highest each ch questionnaire eligible.
recode HH51 (sysmis = 0) (else = copy) into ch05Tot.
* Give value 1 to each mn questionnaire completed.
recode HH55 (sysmis = 0) (else = copy) into ch05Comp.
* Give value 1 to highest each ch questionnaire eligible.
recode HH52A (sysmis =0) (else = copy) into ch517Tot.
* Give value 1 to each mn questionnaire completed.
recode HH55 (sysmis =0) (else = copy) into ch517Comp.
compute total = 1.
variable labels total "".
value labels total 1 "Total".
save outfile = dataPowerBI2.sav
/keep
HH1
HH2
occupied
complete
HH52A
wqSelect
wqhhcomplete
wqsocomplete
HH49
HH53
HH50A
HH50
HH54
HH51
HH55
HH52
HH56
total
HH6 HH7 religion windex5 HHSEX.
SAVE TRANSLATE OUTFILE = 'C:\Users\Power BI SPSS\dataPowerBI22.xls'
/TYPE=XLS
/keep
HH1
HH2
occupied
complete
HH52A
wqSelect
wqhhcomplete
wqsocomplete
HH49
HH53
HH50A
HH50
HH54
HH51
HH55
HH52
HH56
total
HH6 HH7 religion windex5 HHSEX
/MISSING=IGNORE

Importing using R

Power BI is a powerful tool, integrated in the MS Office environment. One of my favorite features is its integration with R and Phyton scripts. Using R of course, you can create a file in a format that is readable by Power BI from the SPSS file and than import it in the program. But also you can just open Power BI and paste or write your R script there and create desired version of the dataset there.

In this example I kept and calculated the same variables as those that would be imported in the excel file created using SPSS syntax. However instead of creating the file to import, script is written so it generates the dataset based on the ran script in Power BI. Because I will be making slicers (another great feature of Power BI) I have also recoded some of the variables from their numerical response categories to textual ones, making it easier for users to navigate through.

library(foreign)path = 'C:/Users/R SPSS'
setwd(path)
# Tonga Data
data = read.spss('C:/Users/SPSS/hh.sav', to.data.frame = T, use.value.labels = FALSE)
# Give value 1 to each HH to calculate total number of hhs sampled.
data$sampled<-1
data$total<-1# Give value 1 to each HH occupied to calculate total number of hhs occupied.
data$occupied<-ifelse(data$HH46 %in% c(1,2,4,7),1,0)
# Give value 1 to each hh interviewed to calculate total no of interviewed HHs.
data$complete <-ifelse(data$HH46==1,1,0)
data$HH52<-as.numeric(data$HH52)
data$HH52A <-ifelse(data$HH52<0,1,0)
# * Give value 1 to each water quality test completed at household.
data$wqhhcomplete <-ifelse(data$WQ11==1,1,0)
# Give value 1 to each water quality test completed at source.
data$wqsocomplete <-ifelse(data$WQ19==1,1,0)
#Give value 1 to highest each wm questionnaire eligible.
data$wmTot <-ifelse(is.na(data$HH49),0,data$HH49)
data$wmComp <-ifelse(is.na(data$HH53),0,data$HH53)
#Give value 1 to highest each mn questionnaire eligible.
data$mnTot <-ifelse(is.na(data$HH50A),0,data$HH50A)
data$mnComp <-ifelse(is.na(data$HH54),0,data$HH54)
# Give value 1 to highest each ch questionnaire eligible.
data$ch05Tot <-ifelse(is.na(data$HH51),0,data$HH51)
data$ch05Comp <-ifelse(is.na(data$HH55),0,data$HH55)
data$ch518Tot <-ifelse(is.na(data$HH52),0,data$HH52)
data$ch517Comp <-ifelse(is.na(data$HH56),0,data$HH56)
data <- subset(data, select = c("HH1",
"HH2",
"occupied",
"complete",
"HH52A",
"wqhhcomplete",
"wqsocomplete",
"HH49",
"HH53",
"HH50A",
"HH50",
"HH54",
"HH51",
"HH55",
"HH52",
"HH56",
"total",
"HH6", "HH7", "religion", "windex5", "HHSEX", "ethnicity"))
#adding Text Labels to Slicers
#Wealth index
data$wealth[data$windex5 == 1] <- "Poorest"
data$wealth[data$windex5 == 2] <- "Second"
data$wealth[data$windex5 == 3] <- "Middle"
data$wealth[data$windex5 == 4] <- "Fourth"
data$wealth[data$windex5 == 5] <- "Richest"
#Area
data$area[data$HH6 == 1] <- "Urban"
data$area[data$HH6 == 2] <- "Rural"
#Island Division
data$region[data$HH7 == 1] <- "TONGATAPU"
data$region[data$HH7 == 2] <- "VAVA'U"
data$region[data$HH7 == 3] <- "HA'APAI"
data$region[data$HH7 == 4] <- "'EUA"
data$region[data$HH7 == 5] <- "ONGO NIUA"
#religiondata$religion[data$religion == 1] <- "Free Wesleyan Church"
data$religion[data$religion == 2] <- "Latter Day Saints"
data$religion[data$religion == 3] <- "Roman Catholic"
data$religion[data$religion == 4] <- "Free Church of Tonga"
data$religion[data$religion == 5] <- "Other religion"
data

A Power BI dashboard that will be discussed in detail is available here:

--

--

Filip Mitrovic
Analytics Vidhya

Believer in the whole "Better data, better lives" thing. Wish I had a dog.