Originally posted by Suman Deb Roy on the betaworks data blog.
Few datasets are as rich, complex, dynamic, near-chaotic and close to real-world as weather data. In the age where apps are intuitive and extremely user-friendly, consumers still receive weather forecasts in a traditional format — predominantly through numbers such as temperature, humidity etc. A new weather app called Poncho aims to disrupt this — by transforming weather prediction numbers into personalized verbal messages of communication. In this blog post we detail how we implemented a data-driven approach to scale the Poncho service across the US by detecting specific subspaces in geo-spatial weather patterns. Each subspace is matched with an editorial message which is strongly guided by how humans cognize weather prediction into actionable insights.
Data: Poncho has a data-intensive backend, which collects weather data from approximately 42,000 zip codes across the US. Each zip code can be thought of as a data point in vector form, with features such as temperature, humidity, wind speed, precipitation, a natural language summary (‘cloudy’, ‘rain’ ). Some features contain time series information — such as precipitation or temperature variation over the next 36 hrs. Given these feature vectors, our goal is to find zip codes that possess striking similarity in weather patterns over the next x hours (where x depends on the cycle of computation). During each cycle (see figure below -left), our algorithm is tasked with finding all zip codes that bear strikingly similar weather given the dynamic variability in weather conditions. In other words, we must cluster zip codes by weather similarity.
Subspaces: Poncho’s uses a dynamic heuristic subspace clustering algorithm, which differs from traditional data clustering techniques such as k-means or spectral clustering. This choice is primarily motivated due to constrains such as: (1) the inherent nature of weather data to be extremely dynamic and (2) the required alignment of retrieved cluster centroids to the editorial message standards. What this means is many of the data dimensions are irrelevant to editors under different weather conditions. Instead of searching over the entire vector space of data, we search for patterns in specific subspaces within the data space. The choice of subspaces is strongly driven by editorial messaging rules. For example, a message for (windspeed > 12 mph + rain) vs. (windspeed > 12 mph + no rain) will be quite different (e.g., for the first condition the message might be: ‘you will find it difficult to hold on to your umbrella’). In fact, the priority of these conditions changes seasonally, e.g., you probably want to omit talking about humidity in the winter. Thus, clusters are extracted not just based on the sensor data, but also the heuristic rules.These rules determine which subspaces of weather to prioritize when clustering. Searching in specific subspaces enables extraction of cluster centroids which can be easily transformed into intuitive and actionable messages. Our algorithm is an extension of the general subspace clustering, attuned to weather forecasts. In the figure below, (A) shows the k-means cluster whereas (B) shows the result of dynamic heuristic subspace clustering.
The dynamic heuristic subspace clustering (C) algorithm detects weather clusters based on heuristic rules determined by Poncho’s editors. This creates more intuitive clusters compared to traditional clustering mechanisms, such as k-means (B)
Weather Cluster Patterns: Shown below are results of dynamic heuristic subspace clusters in New York state. Normally we would assume a cluster to contain geographically closer (neighborhood) locations exclusively. However, this is not always true. Notice how Long Island and some parts of Rochester have quite similar weather patterns for this day. There are other factors at play which can cause distant location to bear very similar weather conditions, like distance from a water body, terrain etc.
In some states, we find weather cluster boundaries ordered by their distance from the ocean. A good example of this pattern is Massachusetts, where we often detect four clusters — ordered by distance from the coast. This feels intuitive. Weather conditions near the coast are strongly influenced by the ocean. As we go more inland, the clusters resemble clear demarcations in weather boundaries, typically based on wind speed and precipitation levels.
Delivering Weather Forecasts: Once clusters within a state are determined, it is presented in an admin panel (shown below). Each entry in the admin panel refers to one cluster. Each cluster contains a multitude of zip codes that share very similar weather patterns. The admin panel allows for a single message to be broadcasted to all the geo-locations within that cluster. The message is constructed based on the centroid of the particular cluster, in additional to the editorial voice that humanizes the forecast experience.
Delivering personalized weather forecasts across thousands of zip codes
One of the key things we aim to achieve with this form of data science is to connect algorithmic results to interpretable artifacts. A priority is making the cluster results as actionable as possible for editors. This post takes a bird’s eye view on a critical pipeline — from the editors’ message composition factors to translating them to mathematical heuristics and partitioning the data using dynamic heuristic subspace clustering, followed by feeding the cluster results back to the admin panel so messages can be broadcasted to users in specific zip codes.