Lenses on HDInsight

Andrew Stevenson
lenses.io
Published in
6 min readJun 4, 2019

Building a successful data platform not only includes selecting and deploying the correct infrastructure but also providing access to the platform and visibility into the data flowing through it. Apache Kafka is an excellent choice as a middleware layer but adding true business value requires access to the data and enabling every persona in your organization to help drive a data-driven culture. Lack of visibility, governance, monitoring and skills make this difficult.

Lenses, a DataOps overlay platform, provides the key pillars required for successful DataOps on streaming data, visibility, accessibility and monitoring for all.

With the power of SQL, Lenses brings browsing, continuous queries and easily deployed SQL processors, allowing for rapid development of real-time applications. Governed by role-based security and backed by Data Policies Lenses also helps guard sensitive data and helps you move to production faster.

Apache Kafka, although a key component is not the only piece in your architecture. Typically you will have a range of data stores or applications that act as data sources or sinks. You also need secret management and most certainly a platform to host and scale out to meet your processing needs such as Kubernetes.

Azure HDInsight is the perfect platform. Let’s look at what’s on offer:

It includes:

  • Best in class security with BYOK (bring your own key) encryption, custom Virtual Networks and topic level security with Apache Ranger
  • Managed Disk integration, enabling much higher scale and low TCO
  • Simplified monitoring with Azure Log Analytics
  • Broad application support on Azure Marketplace

From a DataOps perspective, this is awesome. Managed Kafka and integration with other HD Insight offerings that can be used to make a complete data platform.

Azure also offers a range of other managed services needed in a data platform such as SQL Server, Postgre, Redis and Azure IoT Eventhub. The list goes on, but it also includes Azure Kubernetes Service (AKS), for application scaling and Azure KeyVault so secrets can be stored securely.

Managed, reliable services that allow organizations to concentrate on building data-driven applications and getting into production faster.

Lenses on HDInsight

Lenses, available in the Azure HDInsight marketplace, allows users to load data in and out of their HDInsight cluster, view and inspect data and deploy SQL Processors to join, aggregate and transform data in real-time.

Lenses on HDInsight Marketplace

Let’s have a look at the data visibility aspect. A common complaint about stream processing is the lack of visibility. Apache Kafka is no exception. Data is streaming through Kafka but inspecting data, debugging and checking for messages is hard. This is difficult for your developers but even harder for anyone who wants to be data-driven. DataOps encourages everyone to be data-driven.

Lenses opens up SQL to streaming data. With the Lenses SQL engine, you can browse data as you would in a traditional RDBMS and also run continuous queries to effectively tail the stream via SQL. Support is included for projections, filtering, joins, aggregations and functions, including user-defined functions.

Lenses browsing SQL

Governance and Security

Data governance and security is paramount. As such Lenses offers role-based security, Active Directory and LDAP integration and also topic blacklisting and whitelisting. This allows for multi-tenancy on top of Kafka. For example, we can blacklist all users form a US trading desk from seeing data or even knowing that the topics exist for their European colleagues.

Lenses goes a step further. We also can redact data at the presentation layer, for example, hide sensitive data contained within the message payload from operations. We do this via Data Policies. Data Policies allow data stewards to define a set of field alias and apply redaction on any field identified in a topic that matches the alias. Lenses will also identify which applications are using this data to help you track the lineage.

Data Policies
Data redaction

What about data pipelines and integration?

Data pipelines form the backbone of any data platform, ingesting data, performing transformations and sinking the data to datastore.

Getting data in and out of the cluster

The Stream Reactor is a collection of over 25 Kafka Connect Connector, that includes a SQL layer to simplify configuration. Lenses can deploy and manage connectors in multiple clusters with support for custom connectors available. Connectors are the first step to building a code-free DataOps platform, providing an excellent, fault-tolerant way to achieve solid data integration. See how Connected Homes used our Elastic connector to sink billions of IoT events a day.

Sample Stream reactor connectors

Data transformation

SQL Processors can be deployed and monitored to perform real-time transforms and analytics, supporting all the features you would expect in SQL like joins and aggregations.

Lenses SQL Processor

There are three modes it runs in:

  • In Process, inside Lenses for testing only
  • Apache Kafka Connect
  • Kubernetes

If you are on Azure, we recommend Kubernetes and the AKS managed service. Simply grab the kubeconfig from your AKS cluster and tell Lenses you want to run in Kubernetes mode. You can then deploy SQL processors to your AKS cluster, scale them up or down and view logs and metrics and achieve fault tolerance. The Kafka Connect mode offers the same benefits if you don’t have Kubernetes.

Kubernetes runners
Kubernetes pod logs

Application Landscapes

I’ve mentioned a couple of times in this blog that the application landscape is the interesting part for DataOps, after all, it’s your data and what you do with the data that will determine success for your data-driven organization.

Deploying connectors, SQL processors and custom applications can all be done in isolation, decoupled by the middleware layer, such as Kafka, but together they form pipelines. These pipelines form application landscapes that describe the running processes, the lineage, of your data platform.

Lenses brings together this landscape in the form a topology. Running applications are dynamically added, recovered at startup and the topics involved are added. Your landscapes are shown in real-time with metrics, with the ability to drill into each node.

Application Landscapes

You can also practice GitOps by exporting and importing topologies. Keep your desired state in Git and have Lenses realize it in your data platform built on Azure HDInsight. For example, use the lenses CLI to export your landscape.

lenses-cli export acls --dir my-dir
lenses-cli export alert-settings --dir my-dir
lenses-cli export connectors --dir my-dir
lenses-cli export processors --dir my-dir
lenses-cli export quota --dir my-dir
lenses-cli export schemas --dir my-dir
lenses-cli export topics --dir my-dir
lenses-cli export policies --dir my-dir
my-landscape
├── alert-settings
│ └── alert-setting.yaml
├── apps
│ ├── connectors
│ │ ├── connector-1.yaml
│ │ └── connector-2.yaml
│ └── sql
├── kafka
│ ├── quotas
│ │ └── quotas.yaml
│ └── topics
│ ├── topic-1.yaml
│ └── topic-2.yaml
├── policies
│ └── policies-city.yaml
└── schemas
├── schema-1.yaml
└── schema-2.yaml

Summary

Azure offers a fantastic array of services to build data platforms. HDInsight offers enterprize grade managed Kafka to reduce your operational burden. Lenses provides the DataOps overlay to empower everybody using your data platform to get into production faster.

Relevant Links

--

--