Data Philanthropy: My experience at Nectar’s Data Swarm Event
Of the many ways to support charities and the causes we believe in, financial donation is both the most straightforward and the most prevalent. 57% of the adult UK population gave directly to a charity in 2018, compared to 16% of us volunteered in some capacity. While harder to quantify, the value of donated time can be equally high, with many organisations relying on volunteers to deliver their core programmes of work. When we think of charitable volunteering, we tend to imagine event orchestration and front-line work. Consultancy also constitutes a significant part of the volunteering landscape, with many charities leveraging corporate partnerships to plug internal skills gaps and Skills-based initiatives such as Reach exist to connect specific individuals with the organisations that need them.
Nectar’s Data Swarm, facilitated in tandem with CSR & Philanthropy agency The Giving Department, is an interesting example of highly skilled volunteering in which more than 80 analysts and data scientists from Nectar and Sainsbury’s assemble for a Hackathon designed to solve the data problems of small to medium sized charities. At Comic Relief we have a comprehensive data warehouse and three in-house analysts to report and advise on a national television, media, digital, and analogue community fundraising campaign, as well as tailor communications to email and postal audiences. While in the private sector our data setup would likely be considered prohibitively lean for an organisation of our size, revenue, and diverse activity portfolio, we are comparatively fortunate; many charitable organisations lack in-house data professionals entirely, and as such are throttled in their ability to fully leverage the data at their fingertips. As a firm believer in the power of effective data analysis to drive better decisions and de-risk innovation and change I was keen to see how a charitable donation of data skills at mass scale would play out in practise. I volunteered myself as a participant for the event as soon as I heard about it and a few weeks later I was in the basement of Sainsbury’s London Headquarters with a borrowed laptop, part of a small unit attempting to build a predictive model from an open data set in six hours in benefit of Missing People UK.
Our task — carefully and impressively scoped in advance of the event by the team at Missing People — focused on the Millennial Cohort Study, a three-yearly survey of 18k+ consistent participants who were born in the UK in the year 2000 or 2001. It’s a very comprehensive piece of data collection; thousands of questions touching several areas of a child’s lived experience including education, home environment are asked in face-to-face interviews with both children and parents every time a sweep is conducted. In the 2015 study, Missing People were able to include a question to ask whether a teenager had stayed out overnight without parents knowing where you are in the last 12 months. Of the participants who were asked this question, 842 of 11,593 answered positively.
This formed the basis of Missing People’s ask to Nectar and Sainsbury’s: if we take staying out overnight without parental consent as an instance of going missing, we have a wealth of variables within the 2015 study and all prior and subsequent sweeps against which to correlate this target and provide a richer portrait of those teenagers who do go missing, and the factors that precipitate the event. What can we learn from this rich and varied dataset?
Before the day, engineers from Sainsbury’s loaded the data from the 2012 and 2015 surveys into a Snowflake data warehouse. We were given burner Snowflake accounts, data dictionaries, and Jupyter notebooks with which to access the database. This excellent level of preparation by hack day facilitators allowed participants on the morning to get right into the meat of the analysis as soon as the hackathon began.
Six small to medium charities took part in the event, with analysts being assigned to solving the data problems of Onside, Bounce Back, Together for short lives, Pursuing Independent Paths, Missing People, or The Anne Frank Trust prior to the event. On the morning, the large team assigned to Missing People’s questions was sub-divided into smaller units, most looking at correlations describing the likelihood to go missing within different modules of the 2015 survey itself; i.e. how does a child’s experience with crime affect their likelihood to go missing? What about their personal wellbeing, education, happiness, propensity to take risks?
My sub-team’s task was a little bit different. We decided to look at the 2012 study, with the question; ‘given the data collected when a child was 11, can we predict whether they will go missing at 14?’ and attempt to build a predictive model. The practical implications of such a model are tangible. If such a predictive link between the two datasets exists, we may be able to use it as the basis for a set of questions to ask 11 year olds and understand their level of risk and facilitate an early intervention approach.
In Data Science, the lion’s share of the effort is often in iteration and feature engineering. You have a target outcome and some number of variables which potentially predict that outcome; assuming the problem is predictable, there exists some secret recipe which best combines variables and combinations of variables, model types, and parameters that will provide the optimal estimative relationship between your input variables and your target field. It’s unlikely you will ever find the absolute optimal solution to the problem space. In commercial Data Science, the decision to put a model into production is usually made when the modelling project has reached such a point of diminishing returns that the Data Scientist’s time would be more valuable focused elsewhere. Within six hours you are unlikely arrive at that inflection point unless your problem space is tiny. My goal therefore was to establish a baseline for the problem — is the target predictable given the inputs, and what features might contribute to a strong estimation?
Given the time constraints, my approach was a crude one: join all the tables together and use a Random Forest model (known for strong, out-the-box performance on poorly understood datasets) to understand the overall predictability and isolate some of the most predictive variables before moving on to a more targeted modelling phase. Due in part to spending the morning chasing a red herring and significant time outage due to getting locked out of the Sainsbury’s network I wasn’t able to move to the next step, which would have been honing in on features and playing with different model types to achieve a better outcome. At the end of the hack day, I had produced a model with 94% accuracy, but low possibility for utilisation: it achieved its strong results by predicting that almost no one was going to run away, a common problem when trying to predict a binary target with unbalanced results. This would make it unsuitable for identifying risk in target populations. Fortunately a member of my subteam took a smarter approach to her process, and was able to construct a decent logistic regression model and provide Missing People with some key inferences and factors.
Despite the lack of a strong model, I was able to infer some valuable points from the exercise:
- It appears it is overall possible to predict with reasonable accuracy whether a child will go missing at age 14 from questions they are asked at age 11. Given that reasonable models were produced in a few hours, it would be worth investing further time in building a model to inform an early intervention approach using the Millenial Cohort questions.
- The survey question with the highest weight in both models we created was ‘how long did the interviewer spend in the house?’. At first this seemed bizarre, but the fact that two completely different models reached the same conclusion lent credence to the finding. Our working hypothesis is that the time it takes to ask a family thousands of questions provides a reflection of the level of chaos within that household. This hypothesis could be carefully examined under less time-constricted conditions.
- Finally, the areas of the Millennial Cohort study from which my model picked its most predictive features were related to the attitudes, cognitive and behavioural issues present in the parents, and not the child. This finding made me a bit sad, illuminating the well-known notion that problem behaviours in children are often a reflection of their home environments.
While my model didn’t work as well as I would have liked it to, I left the day feeling hugely exhilarated by the energy, organisation and, most importantly, tangible impact of the day. Among other deliverables, Missing People had a treasure trove of information to work with, Onside had data-driven backing for their current strategy in increasing gender diversity, and Pursuing Independent Paths walked away with a dashboard that proved the value of the work that they were doing, automated by the Swarm analysts to provide ongoing measurement. The day left me with a conviction that skills should be part of the charity donation fabric, and I’m excited to see how Data Philanthropy can be integrated into the existing partnership between Nectar, Sainsbury’s, and Comic Relief.