Geospatial Data Analysis using SQL

Siva Saindra Reddy K
Walmart Global Tech Blog
9 min readJun 24, 2020
Image Source: Google Images

Understanding the Geographical location has now become more important than before. Getting insights from the geospatial data helps in making quicker and optimal business decisions. Microsoft SQL Server provides many functionalities to support geospatial data analysis. In this post, we will start with the overview of geometry vs geography data types in SQL and go on with creating spatial points, circles, polygons, and data manipulating functionalities like spatial intersection, combine, cut, and also analyze some use cases with spatial match, distance, neighbors. In the last section, we will achieve dataflow with SQL for a simple use case on customer data of an E-Commerce company.

1. Geometry vs Geography data types in SQL.

  • The geometry data type is a planar(2-D) spatial data type on a flat surface. We can imagine a geometry spatial object as an object on a map that is on a paper or simply an object drawn on a graph sheet.
  • The geography data type is a terrestrial spatial type object which we can imagine as an object on the curved surface of the earth.
  • Both geometry and geographical spatial type objects are stored in binary format. But, we can easily convert these objects to readable text formats using built-in SQL string functions or even to a GeoJSON by writing custom SQL queries/functions.
  • Some of the functionalities which are available with geometry data type are not available with the geography data type, because of its complex structure. For example, we can easily define a Centroid for Geometry data type object as its planar, but due to the curved surface of the earth, it is difficult to define the centroid for Geography data type object.
  • The advantage here is we can easily interconvert between the Geometry data type and Geographical data type objects. So, we can find the centroid of geography objects by converting it to a geometry object.
  • We can directly use the Geometry data type as we can replicate the entire earth on a single plane. But careful considerations need to be taken while doing calculations like area or distance between two objects etc. Because when we talk about the area of a geospatial object, we are referring to an exact area of an object on the globe. So, we need to convert this object to the Geography data type before doing these kinds of operations.

2. Create a spatial point

If we have latitude and longitude values of the point measured in degrees on the globe then we can create the spatial point as follows.
Longitude = -118.2423
Latitude = 34.0225

https://gist.github.com/Saindra/4a6dca2daf46cb0b5a32af868cba3bdf
Spatial Point

3. Create a spatial circle around a point

To create a circular object around point(-118.2423 34.0225) we have STBuffer() which takes radius value as input. For SRID=4326, the unit of length is meter. So, 5 km=5000 meters.

Note:- The spatial circle which we are referring to here is in the context of the geography of the earth. So, we have to build a circular object with geography data type and then if needed we have to convert it to a geometry data type or a GeoJSON.

We have STAsBinary() that returns the geometry or geography object’s binary representation. With this representation, we can interconvert between the geometry and geography data types.

https://gist.github.com/Saindra/7fc83dcf799e29f905ccbf756852f16f
Spatial Circle

Suppose if we want to create multiple circles around a geo point at a time, we can do the following,
For radius = 5, 10, 15 KMs

https://gist.github.com/Saindra/c2564c64bfbb0b05d832053a2d61a56f
Spatial Circles

4. Create spatial polygon

We can follow a similar approach of creating a spatial point, to create a polygon out of N spatial points. For Point, if P1= -118.2423 34.0225, we used STGeomFromText(‘POINT (P1)’, 4326)

Similarly, if we have N points, we can replace ‘POINT (P1)’ from above with the following ‘POLYGON(( P1, P2, P3, P4, …….. PN-1, PN, P1 ))’

NOTE:- The last point and first point should be the same to close the loop.
Example:- Suppose if we have 3 points as
P1 = -118.3023 33.9927, P2 = -118.1794 33.9902, P3= -118.2436 34.0911

The following query builds the poly object,

https://gist.github.com/Saindra/744b938f90709199295c495ba52f0bbb
Spatial Polygon

Note:- With the data type conversions between geometry and geography, the shape doesn’t change. We are just replicating the curved earth’s surface on a plane and vice versa.

5. Spatial Info

Area:
When we talk about the area of the spatial object, it is in the context of the geography of the earth. So we have to convert the spatial object to geography data type and apply area function. For SRID=4326, the units for the area is square meters.

https://gist.github.com/Saindra/5e2ffbefdb2f88ea7f6bdab280933a5b

This gives 78.51 square meters.

We can apply STArea() on the same geography object by converting it to the geometry data type. But, it will not give exact area w.r.t earth because as discussed geometry data is the exact replica of the geography data (earth surface) on a 2-D plane.

https://gist.github.com/Saindra/bbb82b2f4f70ab6d8f2a13abafc05e10

This will give 0.0000000077. So, we can call this an area factor.

Centroid:

To calculate the centroid of the spatial object we have STCentroid() from SQL. This is only applicable to geometry data type but not geography data type as geography data type objects are not planar.

But we can convert geography data type object to a geometry type object and apply centroid function.

On the top of STCentroid(), we have STX(Longitude) and STY(Latitude).

https://gist.github.com/Saindra/25a6e29c442b7072e6f10355517a8561

6. Spatial Intersection

If we want to find the intersection of any two spatial objects, then we can use STIntersection(). This supports both geography and geometry data types.

https://gist.github.com/Saindra/88cc51daf49667d59cb84b26863d8c7e
Spatial Intersection

Use case: -
We have a table with 2 columns as Shape1 and Shape2 and then if we need a third column Shape3 which is the intersection object of the column: Shape1 and column: Shape2 for all rows in the table.

7. Spatial Cut

Suppose if we have two spatial objects and we need to cut one object from the other, we can use STDifference(). This also supports both geography and geometry data types.

https://gist.github.com/Saindra/7e28fe1f27bf7ed51c274a6cc5f60d2e
Spatial Cut

Use case: -
We have a table with 2 columns as Shape1 and Shape2 and then if we need a third column Shape3 which is the spatial cut object of the column: Shape1 from the column: Shape2 for all rows in the table.

8. Spatial Combine

While working on spatial operations, one might need to combine spatial objects. Combining spatial objects involve two scenarios.

Case:1
Suppose if we have two spatial objects and need to find the union of these two spatial objects or else as mentioned before, we have a table with 2 columns as Shape1 and Shape2 and then if we need a third column Shape3 which is the combine/union object of the column: Shape1 and column: Shape2 for all rows in the table.

https://gist.github.com/Saindra/e82c5166b64d6e11b4df22c0b777041c
Spatial Combine-1

Case: 2
We have a table with 2 columns, one as Shape and other as an ID with some values and we need to combine all the spatial objects in the Shape column group by values in the ID column.

For Example:-
We have a table TBL_GEO_SHAPE ( columns: ID (int), Shape (geometry))

The following query does the transformation as shown in the following figure.

https://gist.github.com/Saindra/87fa9fab3ccaff5dd159e7f27300362d
Spatial Combine-2

9. Spatial Match

This is the most widely used functionality in geospatial analysis. Below are some of the use cases where we need a spatial match.

1. Suppose in a country, if we have all the states and its corresponding geo shapes and also, we have all the zip codes in that country and its geo Shapes. Now, we need to figure out / match each of the zip code to its corresponding state.

2. If we have the location info (latitude, longitude) data of all the Walmart stores or Dominos outlets, etc. and if we need to match all these stores to its corresponding state/city or find out all the stores which are present in a particular state/city or even in a custom geo shape.

We can achieve this by using SQL built-in functions and configuring these with SQL inner join easily.

For Example:

  • We have a table with all the States and its corresponding geo shapes as: TBL_STATES_INFO ( Columns : State_Name(nvarchar), State_Code(int), State_Shape(geometry) )
  • We have another table with all the zip codes and its corresponding geo shapes as: TBL_ZIPCODE_INFO ( Columns : Zip_Name(nvarchar), Zip_Code(int), Zip_Shape(geometry) )

The following query will match all the zip codes to its corresponding states.

https://gist.github.com/Saindra/a79f630f66e7641055b52bbcd851838d

If we have to match latitude, longitude then we need to create a spatial point and do the same. Also, we can append the WHERE clause or ORDER BY depending on our requirements.

Apart from STIntersects(), SQL provides many different functions that we can use depending on our requirements.

  • STContains() — returns 1 if the object passed to this method contains inside the calling object else 0
  • STDisjoint() — returns 1 if both the objects are disjoint i.e. intersection of these objects point sets is empty else 0
  • STEquals() — returns 1 if both the objects are equal i.e. both objects have same point sets else 0
  • STOverlaps() — return 1 if there is a spatial overlap between two objects else 0

if the spatial reference identifier (SRID) of the two comparable objects doesn’t match then null will be returned.

10. Spatial Distance and KNN

A Spatial object is nothing but a collection of all the spatial points that are within the boundary of that spatial object. Suppose if we have two spatial objects, the shortest distance between the two can be defined as a distance between the pair of points, one from each of the spatial objects which are closest among all possible point pairs.

We have STDistance() from the SQL which gives the defined shortest distance between two objects.

If we need the exact spatial distance with reference to earth, as mentioned before we have to apply this STDistance() on geography objects only. Geometry objects needed to be converted to the geography data type.

https://gist.github.com/Saindra/e7baf9929cc467751783944777405ad9

This gives the distance in meters.

If the two objects intersect/touch, then the shortest distance between them will be zero as both objects will have a spatial point in common.
We can apply STDistance() on geometry data type objects also but this will give the relative distance or distance factor similarly as we discussed for the area of the spatial object.

For a spatial object( @g ), we can easily find K-Nearest Neighbours from a given table/dataset using SQL, with the help of SQL distance function in the following way.

For Example:-

  • we have a table TBL_SPATIAL_INFO( Columns : ID(int), Shape(geometry))
  • A spatial Object @g.

The following query gives the K-Nearest Neighbours for spatial Object @g.

https://gist.github.com/Saindra/2038126dee0603c328defba54f6283c5

Where K is a natural number.

One more use case with STDistance() is suppose if we have a spatial object and we need to find all the spatial objects from a dataset/table which touches/intersects with the given spatial object.

As discussed, the distance between two objects is zero if they touch/intersect. So, we can query up as following for this

https://gist.github.com/Saindra/33957ede9fec7f2eae17be74be4a87fa

11. Dataflow with SQL

In this section, we will build a dataflow to analyze the customer sales data of an E-commerce company.

Considering we have customer data i.e., each customer’s purchase amount info and location info (longitude and latitude) and we have cities’ and states’ geo-shapes. We will analyze the total sales from each city and the number of customers from each city of the E-commerce company.

For Example we have

  • TBL_CUSTOMER_DATA (Columns: Customer_Id(int), Customer_Name(nvarchar), Sales_Amount(int), Latitude_dgr(decimal), Longitude_dgr(decimal)
  • TBL_CITIES_INFO (Columns: City_Id(int), City_Name(nvarchar), City_Shape(geometry)
  • TBL_STATES_INFO (Columns: State_Id(int), State_Name(nvarchar), State_Shape(geometry)

The following SQL Script generates the Total_Sales and Customer_Count from each city in all the states.

https://gist.github.com/Saindra/8603f91d2688739113f4a810c27a7d27

With this kind of approach, we can orchestrate the solution to any complex problem or analyze the data by breaking down the problem statement into smaller chunks.

--

--