Spark-Based In-Memory Analytics for In-Memory Data Grids & Databases


IMDGs as a critical enterprise architecture component

Pierce Lamb
Aug 23, 2017 · 8 min read

In-memory data grids have become an indispensable critical infrastructure component for high traffic, customer-facing applications. Highly concurrent transactional applications like hotel reservation systems, financial services systems, and real time Telco systems for billing all use IMDGs as a critical component of system architecture. Additionally, IMDGs are used to power applications that see a lot of backend traffic from mobile apps where personalization, long-lived sessions, and recommendations can be very useful.

IMDGs typically offer a key-value (KV) store that is stored in memory with optional persistence support, high availability and consistency through synchronous replication and disaster recovery capabilities through asynchronous cross data center replication. IMDGs tend to be elastic (you can scale them up or down at runtime and they expand to use the additional CPU, memory, disk and network bandwidth to increase system capacity without adding latency). Most of these systems are deployed in environments where the workload is read heavy with transactional writes being a smaller, but important element of consideration. With almost 15 years of R&D behind them, IMDGs are essential, mature components of technology that now power mission critical applications around the world. We would know, because we helped create one of the most heavily used IMDGs, called GemFire, which was later open sourced as Apache Geode.

Analytics on IMDG data today

Live analytics on transactional data stored in IMDGs, however, is a different story. IMDG based systems were not designed to support analytic class queries or workloads. Any attempt to use such functionality through custom coding would impose heavy performance penalties on the traditional functions offered by these systems (low latency reads and quick transactional writes).

Furthermore, IMDGs do not offer a sophisticated SQL interface or SQL connectors like JDBC and ODBC that would allow BI tools to connect to these systems. Even if they did, the cost of running these queries would swamp the ability of these systems to serve their primary function.

As a result, the traditional approach to IMDG analytics has been to batch transactional data out to files and run overnight batch loads into MPP systems/Hadoop, which requires these files to be preprocessed into columnar format and loaded in before BI tools can act on this data. This offers, at best, stale insights on old data. As transaction volumes continue to grow, there is a significant opportunity cost in the increased time to insights that this approach engenders.

To summarize, problems with IMDGs/NoSQL system that make them unsuitable for live analytic/prediction workloads include:

  • Lack of analytical SQL capabilities in their querying engines
  • Lack of columnar storage which increases memory footprint per row of data
  • Inability to use CPU cache lines effectively which decreases performance
  • Inability to separate analytic workloads from transactional workloads which ends up impacting performance for the entire system
  • Lack of sophisticated ML/AI libraries that can work natively on the data

To work around these issues, the solutions that are frequently deployed are complex, slow, waste resources and ultimately prevent mass adoption within an enterprise. If deployed in the cloud, these solutions lead to higher costs for compute, memory, network and storage in addition to being slow.

A better approach to the problem would require the data to be available in an in-memory analytics system right after the event has been recorded in the transactional layer and for analytics to be performed on the second system, thereby protecting the transactional system from the cost of doing CPU intensive tasks like OLAP querying and machine learning.

Quick review of SnappyData

With SnappyData you have the ability to perform analytics on the most up-to-the-minute transactional data running through your system. SnappyData is a converged in-memory scale out compute + data platform that supports data mutability, real time analytics, streaming and machine learning in a single cluster. We realize this by fusing Apache Spark with the basic capabilities found in typical SQL/NoSQL DBs: concurrency, elasticity and predictable latency. Based on Apache Spark, SnappyData seamlessly works with HDFS and other legacy data sources through Spark connectors.

Figure 1 shows the conceptual fusion of Apache Spark and GemFire that we have used to build a new class of real time analytic platform

Unlike stand alone Spark, SnappyData is a full fledged mutable data store that is a 100% compatible with Spark and allows users to access the data as SQL tables or as Spark Data frames. SnappyData offers connectivity to any data store that has a Spark connector, allowing the user to perform rich analytics across different data stores (as one might find in a classic micro services architecture).

The resulting architecture (shown in Figure 2) is a Spark data platform that is a hybrid SAP HANA-like database with full on SQL and Spark capabilities built into the distribution. The data store offers in-memory data storage in row or column oriented tables, a sophisticated query planner and optimizer, shared nothing persistence, high availability through replication, and high concurrency through highly available ODBC and JDBC drivers. The product supports mixed query workloads allowing users to run any Spark SQL query on SnappyData. It also allows low latency lookups and updates which make it possible to build continuous applications where data does not have to be reloaded periodically from the data source in order to support state changes.

Live analytics on IMDGs data using SnappyData

Next, let’s take a look at how you can perform live SQL based analytics on your IMDG application data using SnappyData. There are two equally important elements to supporting live analytics through SnappyData.

  • SnappyData is a mutable store, and users can propagate CDC changes into SnappyData through Spark streaming, which then get reflected in tables in SnappyData and can be queried via ODBC/JDBC.
  • SnappyData has the ability to create external tables on any data store that has a Spark connector, allowing BI tools to join incoming streams with reference data that is loaded into SnappyData or is sitting in legacy data sources elsewhere in the enterprise.

Figure 3 shows what such a system would look like. Here we depict both NoSQL transactional stores and IMDGs because both of those would work identically when used with SnappyData for analytics.

The approach is to stream the fast moving data sets into SnappyData through Spark streaming (Push), while using the Spark connector to get access to (Pull) reference data that may be relevant to run analytic queries/jobs/algorithms on the data sitting in other legacy stores. One might imagine this as turning SnappyData into the In-memory Analytics Hub.

To accomplish this, the user needs to use bespoke CDC mechanisms on the relevant source tables and connect their output to a stream receiver in SnappyData. The stream processor would bulk insert the stream window into column tables in SnappyData. In parallel to this, for data sitting in other data sources in the enterprise, the user would create external tables to those data sources, and then expose the schema to BI tools or data exploration applications.

Once the data is in SnappyData and the external tables have been created and the connectors initialized, the user has a number of options to build analytic applications:

  • Slice and dice using dynamic analytic queries
  • Use BI tools to build dashboards
  • Use Python, Scala and R to build data science applications
  • Use machine learning libraries to glean insight from the data
  • Build custom visualizations that update in real time

Interesting Potential Applications

The approach outlined above offers some tantalizing analytic possibilities that would otherwise be out of bounds for data residing in a high volume transactional data store.

For instance, the ability to do real time clustering to figure out where reservation hotspots are developing in a hotel reservation system and dynamically increase capacity or change prices to capitalize on demand is one potential application.

In a Telco application, the ability to figure out congestion hotspots in real time using machine learning or TopK towers by traffic over time and show them in a streaming real time dashboard would allow operators to take preemptive action to maintain QoS.

The ability to use cell phone location data to figure out ridership levels on the BART or other public systems would allow operators to not have to send physical people in to conduct ridership surveys. When combined with anonymized cohort data of riders, it would be easy to create classification and clustering models around origination and destination graphs for riders as a way to plan capacity or introduce new routes based on real information.

Other potential applications where live analytics makes an immediate impact are in mission critical workloads like fraud detection, & cyber security where the ability to combine NOW information with historic patterns to detect anomalies and prevent bad things from happening can be of critical importance.

Announcing- Live Analytics for GemFire applications

SnappyData now offers a highly optimized connector for GemFire/Geode that offers the following capabilities:

  • Expose GemFire regions as SnappyData tables/Data frames
  • SQL Joins between SnappyData tables and GemFire regions exposed as external tables in SnappyData
  • Ability to save Data frames (both static and streaming Dataframes) to GemFire as regions
  • Ability to turn SQL queries into OQL query pushdowns into GemFire for efficient execution
  • Ability to join data across multiple GemFire grids as if they were part of one virtualized data grid
  • Ability to join GemFire data sitting in SnappyData with reference data sitting in other data stores like Postgres or Cassandra
  • Ability to support PDX instance types in SnappyData

Combined with a streaming receiver that allows users to stream data from GemFire into SnappyData, where it is stored in columnar format, optimized for analytic processing, the GemFire connector offers an easy and efficient way to do live analytics on changing data stored in GemFire. More details on the SnappyData GemFire connector can be found here. This solution offers a first of kind analytics solution for live transactional data stored in GemFire and connects to other relevant data stored in other repositories in the enterprise.

If you are looking to get live analytics on data stored in NoSQL data stores, or IMDGs such as GemFire, reach out to us today. The connectors used in this example are closed source and require a product license for production deployment. You can reach out to us to get an evaluation going. You can download the latest version of the product here

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade