How BigQuery GIS scales up your geospatial projects

I-Feng Lin
5 min readOct 14, 2018

--

Google released a new feature in open alpha in BigQuery, the data warehouse that enables users to manipulate and query data with SQL commands. The new feature is named BigQuery GIS, which empowers the data platform to support geometry data types and spatial operations. The introduced commands look a lot like those in PostGIS, which is a great benefit for PostgreSQL users. What makes BigQueryGIS outstanding is to apply these commands to data retrieved from BigQuery in fast response times.

What is BigQuery

BigQuery is one of Google’s big data solutions, which offers a RESTful web service to perform the analysis of large datasets in conjunction with Google Storage. It offers SQL-like languages, namely Standard SQL or Legacy SQL, to retrieve and manipulate data in TBs within seconds. With the recent integration of machine learning and geospatial information systems, this data warehouse is virtually transformed into a data analytics platform in the cloud.

What is BigQuery GIS

Big Query Geospatial Information Systems, or BQ GIS for short, is the enhancement of BigQuery with features to support geospatial data types and functions. You can store points, line strings, multi-polygons, and collections in the tables in BQ GIS with a special data type named Geophghy. Geophghy objects can be directly passed to geospatial functions in Standard SQL to retrieve the results or store them in a new table. The supported functions including constructions, transformations, predicates, or measures of geospatial objects.

Why BigQuery GIS

  • Reduce complexity — everything in one tool
  • Require only one step to export data from the warehouse
  • Increase speed — geometric operations in BigQuery is fast with large amounts data
  • Scale up easily with BigQuery

Our Use Case and Implementation

We have users across Germany and other European countries. Let’s say we want to create a heat map of the number of users divided by zip regions. Given the reported coordinates of user requests and the German zip region files, we can map every request to a zip region and count the number of users in each region. We can color the map of Germany like this, where red areas have the highest numbers of requests:

And this requires just four steps:

Step 1: Import shapefiles into BigQuery

Download German zip region shapefiles and convert .shape into WKT (well-known text) with GDAL with the following command:

> ogr2ogr -f csv -lco GEOMETRY=AS_WKT plz-gebiete.csv plz-gebiete.shp

Then import the created csv file into a table BigQuery (read here to load with BigQuery UI, command line, or scripts)

Step 2: Convert regions into polygons with Standard SQL. Here we use a geospatial construction command ST_GeogFromText to convert WKT in the table loaded from the csv file into polygons and store it in another table named geodata.de_zip_regions_geo.

CREATE TABLE geodata.de_zip_regions_geoAS (SELECTST_GeogFromText(WKT) polygon,zip_codeFROM`beaconinside-analytics-test.geodata.de_zip_regions`
)

Step 3: Convert coordinates into points with Standard SQL. Here we use a geospatial construction command ST_GeogPoint to convert pairs of float values longitude and latitude in a table into Geography objects Points and store them in another table named demodata.user_requests_geo.

Standard SQL command:CREATE TABLE demodata.user_requests_geoAS (SELECTST_GeogPoint(lon, lat) point,user_id_sdk,created_atFROM`beaconinside-analytics-test.demodata.user_requests`)

Step 4: Count users in each region with Standard SQL. We select from both regions and requests tables and we need a geospatial predicator ST_Covers to be the condition of the join of two tables. The requests whose coordinates are covered by the regions are joined in the new table. It would be equivalent to use ST_Intersects or ST_Contains here. The rows are grouped by their zip code and we count the unique number of user_id_sdk, which is the identifier of users, to get the number of users in regions. The polygon is selected by ANY_VALUE because regions of the same zip code have the same polygon.

SELECT  zip_code,  ANY_VALUE(regions.polygon) polygon,  COUNT(DISTINCT requests.user_id_sdk) num_usersFROM  `beaconinside-analytics-test.geodata.de_zip_regions_geo` regions,  `beaconinside-analytics-test.demodata.user_requests_geo` requestsWHERE  ST_Covers( regions.polygon, requests.point )GROUP BY  zip_codeORDER BY  num_users DESC

The rows of the results consists of the zip code of a region, its Geography (polygon,) and the number of users in the region. The results can also be displayed in a tool named BQ Geo Vis, which is released together with BQ GIS as more like a preview tool because of the number of rows is limited to 2000 and the size of data to 10 MB. Therefore in the following screenshot, you can see only some regions colored. The results are sorted descending in order to plot the top regions. Red indicates a high number of users. Nonetheless, the table can be stored downloaded and plotted by other tools, such as the Germany map showed in the beginning plotted by matplotlib.

Why we like BQ GIS

With all the advantages mentioned earlier, you’d probably still wonder if you want to migrate your project to BQ GIS to have the same outcome. After all, GIS-support is nothing new among databases. What really makes a difference is not what you get but how long it takes when the amount of data scale up. As an active API provider in proximity marketing, Beaconinside receives a huge amount of user requests per day. Mapping requests of one day to regions can take up to seconds. How about mapping requests of one month? We found that BQ GIS is exceptionally successful when the amount of data is large and when really large. The following chart shows that the spatial join and aggregation done with BQ GIS takes slightly more time when the number of requests triples.

We are very happy to see such a technology helps us analyze geospatial data on a large scale without demanding high computation power on our side. Moreover, this simplifies the pipeline of data if you are already using BigQuery as your data warehouse. We look forward to seeing that BQ GIS becomes a standard feature and provides more functionalities, such as coordinate system conversions.

--

--

I-Feng Lin

Data science engineer at home and Brazilian jiu-jitsu fighter at Checkmat Berlin