Faster Experimentation for Location Data: Iggy + Metaflow

tl;dr

Many data scientists have location fields like addressor zipcode in their data. Few unlock information about these locations that might improve their models because sourcing and incorporating geospatial data is hard.

In this blog post we demonstrate how Iggy and Metaflow empower data scientists to quickly enrich locations and then design machine learning workflows to test hypotheses with a simple Python API. Our goal is to get data scientists to the fun part of data science — testing hypotheses and iterating — without the typical overhead.

(Rather go straight to the code?)

Minimizing friction in the data scientist’s creative cycle

Think of the best data scientist you know. Their magic is uncovering insights through a process of exploration, hypothesis generation, and experimentation. This creative cycle — involving a hunch, a quick experiment, a finding, and over again — is where they thrive. The more smoothly and quickly they can cycle through these steps, the more knowledge and value they create for their team.

Quick access to good data accelerates the data science cycle. There’s currently a growing movement around data-centric AI — the realization that accumulating more high-quality training and evaluation data is a surer strategy for building accurate AI models than tweaking complex algorithms (see here, here, and here). But getting access to data that’s both quick and good can be a challenge; it’s typically straightforward to get one or the other, but not both.

Data enrichment challenges are real for those working with location data — think any dataset that has an address, zipcode, or city column. Datasets like these are common among many companies, e.g. rideshare companies who want to optimize pricing, marketers who want to to understand their customers, or any modern real estate platform. Each of these locations refers to a real place in the real world, with loads of relevant information to be discovered and incorporated into models: Is the place populated? Who lives there? What sorts of businesses and amenities are nearby? Is it lively or quiet at night? How much does it rain? How good are the schools? The broadband? Unfortunately, incorporating these types of details into models has traditionally been a complex process involving geospatial data sourcing, cleaning, projection, and spatial joins. Even if your favorite data scientist has a hunch that the number of coffee shops within a short walk of a property might have an impact on its sale price, the amount of work required to test this hypothesis — without any guarantee that it will work — likely makes it a non-starter.

Luckily there are solutions emerging that minimize the friction in the data scientist’s creative cycle. In this post we’ll look at two of them: Iggy and Metaflow. Iggy enriches the location-related columns in any dataset with relevant features about people, places, and the natural environment. To put data into action you need to integrate it with an ML application, which is where Metaflow shines. It allows data scientists to build production-ready workflows using a simple Python API. By combining Iggy and Metaflow, data scientists are empowered to quickly test modeling hypotheses without the overhead for building infra or sourcing, cleaning, and formatting location data.

Iggy: Location enrichment without the overhead

Visualization of Iggy property-level features in San Francisco

Iggy focuses on solving data augmentation problems specifically for data scientists working with locations. For any location in the United States — down to the property level — Iggy can generate features like:

  • number of coffee shops reachable within a short walk
  • median 2-br apartment rent within the relevant census block group, census tract, zip code, or county
  • whether the coastline is reachable within a 10 min drive

Iggy empowers its users by addressing three pain points:

  • Location data sourcing & cleaning: Instead of having to find and prepare each type of data separately, Iggy delivers features from multiple sources in one clean, ready-to-use format.
  • Joining between levels of geographic granularity: With the iggyenrich package, it’s possible to enrich locations at varying levels of resolution — from the state level down to the individual property level
  • Capturing “nearby” details based on the way people actually move between places: Iggy captures local-level detail in terms of walkable and drivable areas that are calculated based on actual road and foot networks, rather than as-the-crow-flies.
Traditional location enrichment (left) uses radius-based aggregation, leading to unrealistic walkable areas over the water and crossing physical barriers. Iggy enrichment (right) captures local-level detail based on actual walkable/driveable areas using underlying road and foot path networks.

Metaflow: Model training and deployment infra as Python code

Photo by T K on Unsplash

Metaflow is an open-source project that addresses many pain points experienced by business-oriented data scientists. It was originally started at Netflix to make it easier and quicker for data scientists to move projects from prototype to production autonomously. Many data scientists have experienced the gap between notebooks, which make the initial phases of exploration and prototyping straightforward, and production ML environments which have many additional requirements: The production workflows need to be robust, scalable, and integrate well with the surrounding infrastructure.

At many companies, a dream setup would allow a data scientist to write simple Python code, like they do in notebooks, iterate on the code and models locally, test easily with larger-scale datasets, and finally deploy to production, often as an A/B experiment, with a single click. Metaflow provides a number of features that make this happen in a way that plays nicely with popular infrastructure choices, like AWS and Kubernetes, ensuring that the solution satisfies needs of both data scientists and engineers responsible for production systems.

With modern ML infrastructure like Metaflow in place, data scientists can confidently experiment with new ideas, such as using novel features provided by Iggy. A sophisticated data science organization should evaluate new features either by backtesting them with historical data, or by subjecting them to a live A/B experiment. To demonstrate how this can work in practice, we will demonstrate the backtesting approach below.

Case Study: Predicting real estate sales prices

Let’s take a practical look at how Iggy and Metaflow can speed up the process of experimentation. In this case we’re starting with a dataset of 38k single-family real estate sales in Pinellas County, FL, recorded between 2019 and 2021. The goal is to predict the sales price per square foot for each property as accurately as possible.

Baseline Model

The basic dataset includes 195 features that describe the property basics you’d expect to see in a real estate listing — number of beds and baths, age of the home, relevant tax zone, building materials, utilities, etc. Sales price values range from $0.07 to $4214.01 per square foot (-1.13 to 3.62 log dollars per square foot), with a mean value of $143.21/sqft (2.16 log dollars per sqft).

This map shows a sample of homes from the training data, colored by the target variable (log price per sqft). We see that homes along the coast generally fetch a higher value, while homes in the more urban inland areas cost less per square foot.

Now let’s imagine that we have a baseline model in production that predicts price per sqft based on the 195 features in the basic dataset. It is implemented as a Metaflow flow with the following steps:

Baseline Model flow

The Metaflow IggyBaselineFlow class is implemented as a sub-class of IggyFlow, which contains all the core functionality we need to build our model (e.g. load and prepare data, train, and evaluate a model).

  • The start step of IggyBaselineFlow loads the benchmark Pinellas real estate dataset into memory. As part of the loading process, load_dataset scales all the continuous features to zero mean unit variance and partitions the dataset into train/validation/test splits.
  • The feature_selection step uses scikit-learn’s SelectKBest module over the training split to select the top-50 features from the dataset based on their mutual information with the target variable, log_price_per_sqft.
  • Finally, the train_model step uses scikit-learn’s RandomForestRegressor to fit a model to predict log_price_per_sqft based on the 50 features selected in feature_selection.

After training and tuning the price prediction model on the training and validation portions of the basic dataset and evaluating on the held-out test set, we see that the baseline model achieves a mean absolute error of 0.104 log dollars per square foot — equating to $38.79/sqft difference between predicted and actual sales prices at the average home value within the test set.

Iggy-enriched Model

Now let’s assume that our favorite data scientist has thought about the problem a bit and wants to play around with features that describe the vicinity of each property in addition to the basic property details. Examples of these features include:

  • Area walkable within 10 min in sqkm (area_sqkm_qk_isochrone_walk_10m)
  • Population of area walkable with 10 min (population_qk_isochrone_walk_10m)
  • Number of Points of Interest reachable with 10 min walk, per capita (poi_count_per_capita_qk_isochrone_walk_10m)
  • Area of park land intersecting area reachable with 10 min walk (park_intersecting_area_in_sqkm_qk_isochrone_walk_10m)
  • Whether the coast intersects the property’s census block group (coast_intersects_cbg)
  • Median year homes were built within the property’s census block group (acs_median_year_structure_built_cbg)

All of these features are available via Iggy enrichment. Because we’re using Metaflow, enriching the basic dataset with these features to create an enriched dataset is as simple as adding a step to the flow:

Enriched Model flow

The enrich flow step adds location-based columns to the basic dataset by joining each row to Iggy’s feature set based on its latitude and longitude columns. It references Iggy data packaged as a collection of feature tables in S3, and the companion iggyenrich module which simplifies joins.

The remainder of the flow stays the same. We can run this updated flow to enrich, train, and evaluate our model on the enriched dataset and see that the model error on the test set is reduced to 0.100 log dollars per square foot — a $37.12/sqft difference between predicted and actual sales prices at the average home value within the test set. This $1.67/sqft (3.85%) more accurate than the baseline model trained on the basic, un-enriched data.

Taking a closer look at which Iggy features were most impactful in lowering the model error, we can check the top-10 most important features in the Baseline and Enriched models:

Feature importance in the Baseline and Enriched models, calculated using the RandomForestRegressor’s feature importances. Important features introduced in the Enriched model are highlighted in bold.

We see that while many of the important features are shared between the Baseline and Enriched models, there are four appearing in Enriched that are not present in Baseline. These are:

  • acs_pct_households_with_no_internet_access: The percentage of households within the property’s census block group having no internet access, per the U.S. Census American Community Survey (ACS).
  • acs_median_age_cbg: The median age within the property’s census block group, per ACS.
  • park_intersecting_area_in_sqkm_qk_isochrone_walk_10m: The amount of park land (in square km) accessible within a 10 min walk of the property.
  • acs_median_year_structure_built_cbg: The median year built for all homes within the property’s census block group, per ACS.

These additional features are interesting because they tell us something about the neighborhood surrounding each property — not just the property itself.

Iggy-enriched models, per tax district

Pinellas County is a relatively large and diverse place. Looking at the map above we can see the general trend that sales price per sqft is higher along the coast than inland, but there are also micro-trends within the different cities that make up the county. Looking at the six largest tax districts represented within the dataset (Clearwater, Dunedin, Largo, Palm Harbor, Seminole, and St. Petersburg), we can see some pretty drastic differences in the way various features are correlated with price per sqft across the locations:

Correlation of four features with log price per sqft, by tax district.

For example, population age is positively correlated with home prices in Clearwater and St. Petersburg (older people tend to live in more expensive homes there), but is negatively or not correlated in the other four districts. Similarly, while the feature indicating whether a property’s census block group intersects the coast is positively correlated in all tax districts, this is most pronounced in Dunedin, Clearwater, and St. Petersburg.

Given the difference in feature distributions by tax district, your favorite data scientist thinks it would make more sense to run these price prediction models on a per-tax district basis. Again, this is simple to do in Metaflow by just adding another parallel step to the flow:

Per-district Enriched Model flow
  • The feature_selection_and_train_model combines the tasks from the feature_selection and train_model steps in the previous IggyEnrich flow. It is carried out in parallel for each tax district.
  • The join step simply combines the results from each tax district.

We use the updated flow to run separate model training and evaluation for each tax district, and see substantial improvement in most tax districts:

Complete results for the Basic, Enriched, and Per-District Enriched models

As a check, we can look at how feature importance varies between the district models. Below is a map showing the importance of the acs_median_year_structure_built_cbg feature for a sample of homes throughout the districts. Sure enough, this feature takes on different importance depending on tax district, with highest importance in Dunedin (0.044) and Largo (0.051), and lowest importance in Seminole (0).

Map showing model importance of acs_median_year_structure_built_cbg feature, by tax district. Tax districts north to south are Palm Harbor (0.016), Dunedin (0.044), Clearwater (0.021), Largo (0.051), Seminole (0.000), and St. Petersburg (0.031).

While enrichment with Iggy features helped in all tax districts except one (St. Petersburg), the most important takeaway here is how different types of Iggy features were impactful across different districts, and the ease with which we could test this using Iggy + Metaflow.

Get started!

In this post we showed how to enrich locations (in this case sold homes) in a dataset with features describing their vicinity with Iggy, and then quickly iterate on experiments involving these features with Metaflow.

If you’d like to try something similar in this work, here are some links to get you started:

Additionally, here are links to more background reading:

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Handling Missing Data — Data Preprocessing

Go Celsius for the Climate

Kaggle Competitions — Where to Start?

SMOTENC (SMOTE) for Pandas DataFrame

Thinking about listing your AirBNB in Seattle? Check out this pricing tool first!

I Forgot How to Spellcheck

How Often Should You Revisit Your Data Governance Maturity Assessments?

What Good Data Product Managers Do — And Why You Probably Need One

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Anne Cocos

Anne Cocos

More from Medium

Elyra 3.3: Pipelines, custom components, and catalogs

The sun rises in Twin Lakes, California (Photo by Beate Porst)

Analyze Bank Transaction Data using Graph (Part 1/3)

Linear Programming with a small example with gurobipy

Sales commission — Breaking down the variable cost

Understanding sales at state level