Geographic coordinate encoding using TensorFlow feature columns

Dmitry Yemelyanov
Riga Data Science Club
5 min readApr 28, 2020

--

Introduction

Feature engineering is an essential technique to improve the performance of machine learning algorithms. This article deals with encoding geographic coordinate features to make TensorFlow model training more efficient.

Geographic coordinates

It is quite common to deal with structured data holding geolocation. Usually geo coordinates are stored each in a separate column and expressed with decimal degrees. Let’s take a look at the popular New York City Airbnb Open data dataset as an example:

Part of New York City Airbnb Open Data dataset

No doubt, geographic coordinates hold very important information. In case of Airbnb listings, location of real estate property has huge impact on the rent price. It is essential to make TensorFlow model get most out of this data!

Feature Columns

The easiest way to input structured data into TensorFlow is using so-called feature columns. TensorFlow documentation states:

Feature column (tf.feature_column) is a function that specifies how a model should interpret a particular feature.

In simple words you can think of it as of a bridge between raw input data and your model — the output of a feature column becomes the input to the model.

Tensorflow offers several types of feature columns that have some specifics in how they are mapping input data.

Numeric Columns

Every neuron in a neural network performs multiplication and addition operations on weights and input data, so the best format to input data into model is numeric.

So does a numeric column — the simplest type of feature column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

Decimal degrees of geographic coordinates are perfect values for neural network to deal with from a mathematical point of view, so the most tempting solution would be to construct two numeric columns as following:

from tensorflow import feature_column
latitude = feature_column.numeric_column("latitude")
longitude = feature_column.numeric_column("longitude")

This straightforward approach has some major downsides:

  • Exact coordinates might introduce noise to the model and increase the risk of overfitting to specific locations instead of finding real patterns.
  • A continuous (scalar) representation of coordinates might have a complex non-linear correlation that would be hard for the model to find using such inputs.
  • The distribution of coordinate values in the dataset might be skewed, so directly using these features can adversely affect the model.

Bucketized columns

Geographic coordinates are better understood by a model when discretized into meaningful groups — buckets (bins).

The transformation of numeric features into categorical features, using a set of thresholds, is called bucketing (or binning)

The concept of bucketing perfectly fits geographic data encoding:

  1. Reduces the impact of small coordinate fluctuations on the model — each bucket smooths out the noise of the data.
  2. Provides model with a richer input comparing to a scalar value — the model now can learn individual weights for each bucket.
  3. Allows defining unequally spaced buckets that eliminate negative effect of coordinate distribution skew within a dataset.

Bucketing has been used in cartography and navigation for centuries widely known in the form of grid reference maps:

The same way as lines are dividing up space on the grid map to ease orienteering and geospatial analysis for humans, bucketized columns are dividing continuous value of coordinates into bins to ease machine learning.

TensorFlow’s feature_column provides bucketized_column method that constructs bucketized column by taking numeric column as an input. The control of binning is available by using boundaries parameter:

from tensorflow import feature_column
latitude = feature_column.numeric_column("latitude")
longitude = feature_column.numeric_column("longitude")
latitude = feature_column.bucketized_column(latitude, boundaries=[...])
longitude = feature_column.bucketized_column(longitude, boundaries=[...])

Defining right boundaries is a crucial decision. There are two options:

  • Buckets with equally spaced boundaries: the boundaries are fixed and encompass the same range (for example, 0–4 degrees, 5–9 degrees, and 10–14 degrees).
  • Buckets with quantile boundaries: each bucket has the same number of points. The boundaries are not fixed and could encompass a narrow or wide span of values.

Make sure to choose right boundaries depending on coordinate distribution within a dataset — imbalanced dataset requires quantile boundaries, otherwise equally spaced boundaries should work fine.

The granularity of the boundaries should be evaluated carefully depending on the business value of the model: fine-grained boundaries might allow to learn neighborhood-specific or even block-specific effects, while coarse-grained boundaries will result in learning country-specific or city-specific correlations.

Model is still lacking insight that geographic location is in fact a concatenation of both coordinates.

With properly designed buckets the model will learn faster and will get better understanding of coordinate features, however there is still one last thing two improve — latitude and longitude are treated as separate inputs and model needs to put additional effort to build understanding that geographic location is in fact a concatenation of both coordinates.

Luckily for us, TensorFlow has a feature column that allows us to explicitly define connection between values of two numeric columns.

Feature crosses come to the rescue!

Feature Crosses

Combining features into a single feature, known as feature crosses, enables a model to learn separate weights for each combination of features.

feature_column.crossed_column([bucketized_lat, bucketized_lon])

This way, instead of having separate latitude and longitude inputs we are explicitly telling the model that both values from these two columns should be treated as a whole.

Feature crossing coordinates enables model to “see” 2D picture of the map making training much more efficient

Using the grid map analogy presented earlier, bucketized columns allow model to “see” distinct coordinate axis thus providing model with 1D information for latitude and longitude separately. Feature cross enables model to “see” the whole 2D picture of the map making training much more efficient.

Summary

Geographic coordinates hold very important information that require careful feature engineering decisions. TensorFlow feature columns API provides all the necessary tools to build efficient data input pipeline.

Follow these three rules:

  1. Bucketize latitude and longitude using:
    a). equally spaced boundaries when dealing with balanced dataset
    b). quantile boundaries when dealing with skewed coordinate distribution
  2. Choose granularity of boundaries depending on the map zoom level you are expecting learn the correlations from:
    a). fine-grained boundaries to learn effects on the local level
    b). coarse-grained boundaries to learn effects on a city or country level
  3. Feature cross bucketized coordinates.

Good luck!

--

--

Dmitry Yemelyanov
Riga Data Science Club

Founder at Riga Data Science Club | Machine Learning Consultant