TrendFinder: How We Developed a “Trend Detection” BI Tool for DonorsChoose.org (guest post)

Michael Zhang
Making DonorsChoose
6 min readJun 19, 2018

This blog post is part of a series of guest posts from the CKM Advisors Pro Bono Team recapping a recent 4-month collaboration, where CKM Advisors designed and developed a product for DonorsChoose.org called TrendFinder.

TrendFinder is a self-sufficient, interactive business intelligence dashboard built to detect and analyze trends on our platform in real-time. Check out some of the interesting things we found using TrendFinder here.

This article will talk about the overall development process of TrendFinder, covering some of the under-the-hood technical details that power the tool along the way.

Introducing… TrendFinder! (snapshot of the “Overview” and “Demographic Breakdown” for “wiggle” projects)

Getting started

When we first took on this project, we weren’t necessarily planning on building any sort of end-user application, let alone a complete business intelligence pipeline and dashboard. Our initial conversations with DonorsChoose.org revolved heavily around the idea of finding new and interesting resource trends on their platform, with “interesting” meaning either a trend that might be considered nontraditional in the context of education (e.g. “wiggle” chairs) or a trend that behaves unexpectedly in terms of geography or demographics (e.g. iPads are being requested by younger and younger classrooms over time). This led us to do a lot of data exploration at the beginning; we brainstormed a list of resource trends we knew had been a part of education over the past decade (e.g. the increasing use of technology such as tablets and laptops in classrooms) and each member of the team dug into a specific resource’s geographic and demographic variables to try and understand what constituted a “trend” on a deeper level and see if trends followed any general patterns.

Defining a trend

After doing this for a few weeks, we started to hit a wall in terms of what direction we wanted to take this project: we wanted to build a model for automatically identifying trends within historical resource request data that could also be used to find real-time trends on the DonorsChoose.org platform, but we were struggling to come up with a clear path to achieving this.

To solve this, we took a step back to redefine precisely what a “trend” ought to be for our use cases. We all had some abstract idea of what resources were “trends” to us — basically anything that had risen or fallen dramatically over time — which brought us to defining three general types of trends:

Growth/decline over time and spikes were the primary kinds of trends we were interested in capturing

Looking at these simplified trend lines, we realized that there was a big difference between the minimum and maximum values of each of these lines, which in our case represents the change in popularity for a resource over time. In statistical terms, this idea can be described as finding the statistical range of the values indicating the prevalence of a resource (as measured by proportion of projects) across time periods of a given date range and finding the outliers for resources that varied tremendously. The statistical range here quantifies how much each resource keyword changed in its own history, so it captures resources that have greatly shifted in prevalence over time.

Of course, this approach is blind to things like directionality, as growth and decline are picked up the same way, but we were hoping to optimize towards having high recall by design, as we knew that any results were going to be filtered through additional qualitative research by any end user who might be using our output. In retrospect, we had essentially stumbled upon a basic approach for anomaly detection in a rather organic way, and we stuck with this framework for detecting both historical and real-time trends.

Developing a methodology

Visual representation of how we’re calculating statistical range with a sample trend

For identifying historical resource trends, we decided to look at the last 10 years of projects on DonorsChoose.org’s platform, looking at both year-over-year and month-over-month values for each word to ensure that we were working with a sufficient number of data points. After getting the range of proportions for each word over these time groupings, we calculated the mean and standard deviation of this collection of word proportion ranges itself and highlighted all keywords that were significantly higher than the mean as potential outliers (μ + 2σ), or “trend,” per our pre-established definition. In other words, we compared the changes over time among all the words and pulled out the ones that changed dramatically more than others.

After getting this list of potential resource trends over the last decade, our data scientists investigated each resource’s demographic and geographic shifts over time as well in order to identify what trends were “interesting” per the original ask (for an overview of what we found, please check out our other guest post focused on our specific findings on different trends on DonorsChoose.org’s platform). To make sure we were not repeating ourselves, we ultimately consolidated these analytic processes into modularized components, forming an overall pipeline where we could extract not only resource trends, but also insights for each individual resource that could support DonorsChoose.org efforts across different teams.

This methodology worked great for discovering historical resource trends, but DonorsChoose.org wanted something that could also continue to produce insights while tracking live trends on their platform, so we had to rethink our algorithm for “real-time” use. Applying the same framework of outlier detection, we realized that instead of doing “inter”-word comparison, we could split off a “current” time period and perform “intra”-word comparisons on each resource’s history to see if its current period’s proportion deviated significantly from its historical baseline.

Example of a “current” trend that would be flagged by TrendFinder

After extensive rounds of testing and research, we were able to determine a weighting scheme to determine how to rank trendiness among resources in a way that was most applicable for DonorsChoose.org and configure a program with appropriate defaults to run recurrently on their internal servers. Eventually, we formalized and automated all of our trend detection and analysis steps to generate an interactive Dash dashboard that the folks at DonorsChoose.org could use as a BI tool to identify and investigate trends on their own.

Takeaway: Build to maximize practical value

Of course, our methodology is not perfect. It’s intentionally tuned to be oversensitive, and it is not very performant on gradual and consistent changes, even if directionality is consistent. However, as a pro bono team, it was important to us that we not only build something that could perform robustly, but also something that could integrate seamlessly and live on in the DonorsChoose.org environment so that they could continually reap the benefits of our work without needing our intervention. Therefore, once we had a viable trend detection algorithm working for both historical and real-time cases, we devoted the entirety of our efforts into creating something user-friendly and scalable with the thousands of projects that come into the platform every single day.

There are countless ways to solve any business problem involving data, but we firmly believe that data scientists should not build for the sake of building. Rather, the end user should always stay top of mind, and creating something easily usable and digestible should always be the first priority. In our case, developing and testing an anomaly detection algorithm that relied on straightforward statistical foundations fell best in line with our goals, as at the end of the day, we wanted to empower DonorsChoose.org to be able to make data-driven decisions in an area they previously lacked the know-how for to advance their overall mission of providing all students with access to the resources necessary for a great education.

--

--