Apache Hive: ESRI GeoSpatial Support

6 min readJun 15, 2023

What is GeoSpatial?

In simple terms, “GEOSPATIAL INFORMATION” is geography and mapping. It is “place based” or “locational” information. It is data tied to a map and capable of being portrayed on a map. With the accumulation of vast quantities of location information, it has become very important to store and manage this information efficiently. When this information is used efficiently, it can provide many useful insights which could not be generated using raw data. When location information is tagged with normal data, it gives a different dimension to the analytics. The information is useful in different fields like fraud detection, user behaviour prediction, weather forecasting and many more. Due to the nonlinear nature of geospatial data, it can not be stored and processed as normal data. Special storage and indexing techniques are used to make the processing faster.

GeoSpatial Data With Hive

Hive provides adding ESRI based standard geospatial data directly into hive tables. It even provides several GeoSpatial UDFs which can be used to process the geospatial data and provide meaningful insights. Along with the capability to operate on these data, the geospatial data in hive tables can visualized graphically when linked to tools like PowerBI & Tableau. These all capabilities are available in Hive out of the box. The UDFs are autogenerated by Hive. The user need not load the jars or create the functions. Hive internally manages these UDFs so that users need not worry about the compatibility between different versions. The data can be loaded from any of the standard formats like GeoJSON, WKB or WKT. Once the data is loaded, users can operate on this data by using the various inbuilt functions supported.

Creating Tables Supporting GeoSpatial Data

Tables supporting GeoSpatial data can be created normally like any other table. Just make the column that stores the GeoSpatial data a BINARY type. Natively creating Geometry/Geography type is not yet supported. Users can keep the geospatial data in other data types. Those fields can be combined using the UDF provided to form a Geometry type during query execution.

Table with double fields representing coordinates :

In the table below, the earthquake coordinates are stored as double fields. Table earthquakes stores the earthquake occurrence info backed by a CSV file.

CREATE TABLE earthquakes (
      earthquake_date STRING,
      latitude DOUBLE,
      longitude DOUBLE,
      depth DOUBLE,
      magnitude DOUBLE,
      magtype string,
      mbstations string,
      gap string,
      distance string,
      rms string,
      source string,
      eventid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Table using binary columns to store Geometry data :

Binary fields can be used to load the data stored in standard geometry formats. A Hive provided Json serde can be used to load GeoJson data stored in a file to HIve. Data stored in KWT and KWB are loaded directly and do not need any serde to convert the data before loading. The counties table, stores the boundary info for each county backed by standard geospatial format “EnclosedEsriJsonInputFromat”. Column BoundaryShape stores the boundary of each county.

CREATE TABLE counties (
        Area string,
        Perimeter string,
        State string,
        County string,
        Name string,
        BoundaryShape binary)                  
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.udf.esri.serde.EsriJsonSerDe'
STORED AS
 INPUTFORMAT 'org.apache.hadoop.hive.ql.io.esriJson.EnclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Adding Data to the Tables

Once the table is created, the data can be ingested into the tables using either direct inserts into the tables, or if the data is available in some standard geospatial formats, loaded using the load method.

Direct Inserts into the Tables:

To Insert data directly into the tables, we can run a normal insert query as follows

INSERT INTO dist_binary          
    VALUES (
       ST_Point(12.5,16.2),
       ST_Point(18.7,4.5)
     );

This query inserts two coordinates into the table. The ST_POINT(X,Y) is a hive provided UDF which converts the coordinate into GeoSpatial Standard binary format.

LOAD data into the tables from Standard GeoSpatial Data Files:

If the data is available in some standard data format in some external files, it can be directly loaded as follows

LOAD DATA INPATH 'earthquake-demo/counties-data/california-counties.json' 
OVERWRITE INTO TABLE counties;

Operating/Querying the GeoSpatial Data:

Hive provides a bunch of GeoSpatial UDFs which can be directly used for simple scenarios like calculating the distance between two points or clubbing them into complex GeoSpatial queries.

Calculating distance between 2 points:

select
 ST_AsText(pointa) as First_Point,
 ST_AsText(pointb) as Second_Point,
 ST_Length(ST_LineString(array(pointa,pointb))) as Distance
 from dist_binary;

This statement computes the distance between the two coordinates in a row.

The output looks something like this:

Figuring out the number of earthquakes in a county using its coordinates

  SELECT counties.name, count(*) cnt
    FROM counties
         JOIN earthquakes
    WHERE 
       ST_Contains(counties.boundaryshape, 
                             ST_Point(earthquakes.longitude, earthquakes.latitude))
    GROUP BY counties.name
    ORDER BY cnt desc;

This calculates the number of earthquakes in various countries using the earthquake coordinates and countries geometry.

The output looks something like this:

Available Hive GeoSpatial UDFs

The Hive Engine provides direct support for all the UDFs provided by ESRI. The list and details of which can be found at:

UDF Documentation

The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for…

github.com

Integrating with CDV/Tableau/PowerBI

Example Integrating with CDV(Cloudera Internal-Similar to PowerBI & Tableau)

Create a table to store the delay time and coordinates of Indian Airports:

create table india_airport_delays(name string, coordinate binary, delay int);

Insert some sample data into the table:

insert into india_airport_delays values
('Chandigarh Airport', ST_Point(30.667767, 76.786232), 8),
('Diu Airport', ST_Point(20.714273, 70.921616), 2),
('Kempegowda International Airport (BLR)', ST_Point(13.199379,77.710136), 11),
('Biju Patnaik International Airport (BBI), Bhubaneshwar', ST_Point( 20.254780 ,85.816521), 3),
('Visakhapatnam Airport, Andhra Pradesh,', ST_Point(17.728647,83.223549), 5),
('Indira Gandhi International Airport (DEL), New Delhi', ST_Point(28.556160,77.100281), 14),
('Chhatrapati Shivaji International Airport (BOM), Vile Parle East, Mumbai', ST_Point(19.097403,72.874245), 10),
('Sardar Vallabhbhai Patel International Airport, Ahmedabad', ST_Point(23.0734,72.6266), 7),
('Lucknow Airport', ST_Point(26.761,80.889), 6),
('Jaipur Airport', ST_Point(26.824,75.812), 4),
('Devi Ahilyabai Holkar Airport, Indore', ST_Point(23.0734,72.6266), 2),
('Guwahati Intl Airport', ST_Point(26.106,91.586), 6),
('Cuttack Airport', ST_Point(20.550,85.886), 7),
('Raxaul Airport, Ekderwa, Bihar,', ST_Point(26.996279,84.820374), 3);

Create a View using this table which can be used by CDV for visualization purpose:

create view india_airports as select name, ST_X(coordinate) as latitude, ST_Y(coordinate) as longitude, delay
from india_airport_delays;

Check the contents of the original table(Optional)

select name, ST_AsText(coordinate), delay from india_airport_delays;

Output:

Check the contents of the view. (Optional)

select * from airports.india_airports;

Output:

Create a CDV connection and dataset using the View. Then create Visuals using normal CDV UI commands. (Different for PowerBI or Tableau)

Sample Output:

The above adds 4 Visuals into the Dashboard. The above two Visuals are full displays of all the airports of India.

The Second row uses CDV filter on the delay column of the view to show only the Airports with a delay time of more than 5 Minutes.

Magnified Output of Airport With Delay > 5 Mins

Future Scope

Add support for Native GeoSpatial data types, rather than storing them as Binary/Text
GeoSpatial data specific performance improvements.
Support for creating QuadTree dynamically during query processing.
Data partitioning on geometry column based on sample data.
CBO support for geometry type.
Connector for ArcGIS and QGis like tools.

OpenSource Jira:

Native GeoSpatial Support in Hive

This is an epic Jira to support GeoSpatial datatypes natively in Hive. This will cater to the applications which…

issues.apache.org