Faster Experimentation for Location Data: Iggy + Metaflow
Unlock the potential behind the location fields in your data
Many data scientists have location fields like
zipcode in their data. Few unlock information about these locations that might improve their models because sourcing and incorporating geospatial data is hard.
In this blog post we demonstrate how Iggy and Metaflow empower data scientists to quickly enrich locations and then design machine learning workflows to test hypotheses with a simple Python API. Our goal is to get data scientists to the fun part of data science — testing hypotheses and iterating — without the typical overhead.
(Rather go straight to the code?)
Minimizing friction in the data scientist’s creative cycle
Think of the best data scientist you know. Their magic is uncovering insights through a process of exploration, hypothesis generation, and experimentation. This creative cycle — involving a hunch, a quick experiment, a finding, and over again — is where they thrive. The more smoothly and quickly they can cycle through these steps, the more knowledge and value they create for their team.
Quick access to good data accelerates the data science cycle. There’s currently a growing movement around data-centric AI — the realization that accumulating more high-quality training and evaluation data is a surer strategy for building accurate AI models than tweaking complex algorithms (see here, here, and here). But getting access to data that’s both quick and good can be a challenge; it’s typically straightforward to get one or the other, but not both.
Data enrichment challenges are real for those working with location data — think any dataset that has an
city column. Datasets like these are common among many companies, e.g. rideshare companies who want to optimize pricing, marketers who want to to understand their customers, or any modern real estate platform. Each of these locations refers to a real place in the real world, with loads of relevant information to be discovered and incorporated into models: Is the place populated? Who lives there? What sorts of businesses and amenities are nearby? Is it lively or quiet at night? How much does it rain? How good are the schools? The broadband? Unfortunately, incorporating these types of details into models has traditionally been a complex process involving geospatial data sourcing, cleaning, projection, and spatial joins. Even if your favorite data scientist has a hunch that the number of coffee shops within a short walk of a property might have an impact on its sale price, the amount of work required to test this hypothesis — without any guarantee that it will work — likely makes it a non-starter.
Luckily there are solutions emerging that minimize the friction in the data scientist’s creative cycle. In this post we’ll look at two of them: Iggy and Metaflow. Iggy enriches the location-related columns in any dataset with relevant features about people, places, and the natural environment. To put data into action you need to integrate it with an ML application, which is where Metaflow shines. It allows data scientists to build production-ready workflows using a simple Python API. By combining Iggy and Metaflow, data scientists are empowered to quickly test modeling hypotheses without the overhead for building infra or sourcing, cleaning, and formatting location data.
Iggy: Location enrichment without the overhead
Iggy focuses on solving data augmentation problems specifically for data scientists working with locations. For any location in the United States — down to the property level — Iggy can generate features like:
- number of coffee shops reachable within a short walk
- median 2-br apartment rent within the relevant census block group, census tract, zip code, or county
- whether the coastline is reachable within a 10 min drive
Iggy empowers its users by addressing three pain points:
- Location data sourcing & cleaning: Instead of having to find and prepare each type of data separately, Iggy delivers features from multiple sources in one clean, ready-to-use format.
- Joining between levels of geographic granularity: With the
iggyenrichpackage, it’s possible to enrich locations at varying levels of resolution — from the state level down to the individual property level
- Capturing “nearby” details based on the way people actually move between places: Iggy captures local-level detail in terms of walkable and drivable areas that are calculated based on actual road and foot networks, rather than as-the-crow-flies.
Metaflow: Model training and deployment infra as Python code
Metaflow is an open-source project that addresses many pain points experienced by business-oriented data scientists. It was originally started at Netflix to make it easier and quicker for data scientists to move projects from prototype to production autonomously. Many data scientists have experienced the gap between notebooks, which make the initial phases of exploration and prototyping straightforward, and production ML environments which have many additional requirements: The production workflows need to be robust, scalable, and integrate well with the surrounding infrastructure.
At many companies, a dream setup would allow a data scientist to write simple Python code, like they do in notebooks, iterate on the code and models locally, test easily with larger-scale datasets, and finally deploy to production, often as an A/B experiment, with a single click. Metaflow provides a number of features that make this happen in a way that plays nicely with popular infrastructure choices, like AWS and Kubernetes, ensuring that the solution satisfies needs of both data scientists and engineers responsible for production systems.
With modern ML infrastructure like Metaflow in place, data scientists can confidently experiment with new ideas, such as using novel features provided by Iggy. A sophisticated data science organization should evaluate new features either by backtesting them with historical data, or by subjecting them to a live A/B experiment. To demonstrate how this can work in practice, we will demonstrate the backtesting approach below.
Case Study: Predicting real estate sales prices
Let’s take a practical look at how Iggy and Metaflow can speed up the process of experimentation. In this case we’re starting with a dataset of 38k single-family real estate sales in Pinellas County, FL, recorded between 2019 and 2021. The goal is to predict the sales price per square foot for each property as accurately as possible.
The basic dataset includes 195 features that describe the property basics you’d expect to see in a real estate listing — number of beds and baths, age of the home, relevant tax zone, building materials, utilities, etc. Sales price values range from $0.07 to $4214.01 per square foot (-1.13 to 3.62 log dollars per square foot), with a mean value of $143.21/sqft (2.16 log dollars per sqft).
Now let’s imagine that we have a baseline model in production that predicts price per sqft based on the 195 features in the basic dataset. It is implemented as a Metaflow flow with the following steps:
IggyBaselineFlow class is implemented as a sub-class of
IggyFlow, which contains all the core functionality we need to build our model (e.g. load and prepare data, train, and evaluate a model).
IggyBaselineFlowloads the benchmark Pinellas real estate dataset into memory. As part of the loading process,
load_datasetscales all the continuous features to zero mean unit variance and partitions the dataset into train/validation/test splits.
feature_selectionstep uses scikit-learn’s
SelectKBestmodule over the training split to select the top-50 features from the dataset based on their mutual information with the target variable,
- Finally, the
train_modelstep uses scikit-learn’s
RandomForestRegressorto fit a model to predict
log_price_per_sqftbased on the 50 features selected in
After training and tuning the price prediction model on the training and validation portions of the basic dataset and evaluating on the held-out test set, we see that the baseline model achieves a mean absolute error of 0.104 log dollars per square foot — equating to $38.79/sqft difference between predicted and actual sales prices at the average home value within the test set.
Now let’s assume that our favorite data scientist has thought about the problem a bit and wants to play around with features that describe the vicinity of each property in addition to the basic property details. Examples of these features include:
- Area walkable within 10 min in sqkm (
- Population of area walkable with 10 min (
- Number of Points of Interest reachable with 10 min walk, per capita (
- Area of park land intersecting area reachable with 10 min walk (
- Whether the coast intersects the property’s census block group (
- Median year homes were built within the property’s census block group (
All of these features are available via Iggy enrichment. Because we’re using Metaflow, enriching the basic dataset with these features to create an enriched dataset is as simple as adding a step to the flow:
enrich flow step adds location-based columns to the basic dataset by joining each row to Iggy’s feature set based on its
longitude columns. It references Iggy data packaged as a collection of feature tables in S3, and the companion
iggyenrich module which simplifies joins.
The remainder of the flow stays the same. We can run this updated flow to enrich, train, and evaluate our model on the enriched dataset and see that the model error on the test set is reduced to 0.100 log dollars per square foot — a $37.12/sqft difference between predicted and actual sales prices at the average home value within the test set. This $1.67/sqft (3.85%) more accurate than the baseline model trained on the basic, un-enriched data.
Taking a closer look at which Iggy features were most impactful in lowering the model error, we can check the top-10 most important features in the Baseline and Enriched models:
We see that while many of the important features are shared between the Baseline and Enriched models, there are four appearing in Enriched that are not present in Baseline. These are:
acs_pct_households_with_no_internet_access: The percentage of households within the property’s census block group having no internet access, per the U.S. Census American Community Survey (ACS).
acs_median_age_cbg: The median age within the property’s census block group, per ACS.
park_intersecting_area_in_sqkm_qk_isochrone_walk_10m: The amount of park land (in square km) accessible within a 10 min walk of the property.
acs_median_year_structure_built_cbg: The median year built for all homes within the property’s census block group, per ACS.
These additional features are interesting because they tell us something about the neighborhood surrounding each property — not just the property itself.
Iggy-enriched models, per tax district
Pinellas County is a relatively large and diverse place. Looking at the map above we can see the general trend that sales price per sqft is higher along the coast than inland, but there are also micro-trends within the different cities that make up the county. Looking at the six largest tax districts represented within the dataset (Clearwater, Dunedin, Largo, Palm Harbor, Seminole, and St. Petersburg), we can see some pretty drastic differences in the way various features are correlated with price per sqft across the locations:
For example, population age is positively correlated with home prices in Clearwater and St. Petersburg (older people tend to live in more expensive homes there), but is negatively or not correlated in the other four districts. Similarly, while the feature indicating whether a property’s census block group intersects the coast is positively correlated in all tax districts, this is most pronounced in Dunedin, Clearwater, and St. Petersburg.
Given the difference in feature distributions by tax district, your favorite data scientist thinks it would make more sense to run these price prediction models on a per-tax district basis. Again, this is simple to do in Metaflow by just adding another parallel step to the flow:
feature_selection_and_train_modelcombines the tasks from the
train_modelsteps in the previous
IggyEnrichflow. It is carried out in parallel for each tax district.
joinstep simply combines the results from each tax district.
We use the updated flow to run separate model training and evaluation for each tax district, and see substantial improvement in most tax districts:
As a check, we can look at how feature importance varies between the district models. Below is a map showing the importance of the
acs_median_year_structure_built_cbg feature for a sample of homes throughout the districts. Sure enough, this feature takes on different importance depending on tax district, with highest importance in Dunedin (0.044) and Largo (0.051), and lowest importance in Seminole (0).
While enrichment with Iggy features helped in all tax districts except one (St. Petersburg), the most important takeaway here is how different types of Iggy features were impactful across different districts, and the ease with which we could test this using Iggy + Metaflow.
In this post we showed how to enrich locations (in this case sold homes) in a dataset with features describing their vicinity with Iggy, and then quickly iterate on experiments involving these features with Metaflow.
If you’d like to try something similar in this work, here are some links to get you started:
- See the full code for this demo and others
- Download the Pinellas County Real Estate Sales dataset and accompanying sample Iggy data to see if you can do better
- Install the
- Install Metaflow
Additionally, here are links to more background reading: