Reposted translation of the original article : https://mp.weixin.qq.com/s/_mNwL5uXSDYyqtLDPx0iDA

1 Introduction

Apache Zeppelin is a web-based notebook that provides interactive data analysis. It is convenient for you to make beautiful documents that can be data-driven, interactive, and collaborative, and supports multiple languages, including Scala (using Apache Spark), Python (Apache Spark), SparkSQL, Hive, Markdown, Shell, and so on. Hive and SparkSQL currently support querying Hudi’s read-optimized view and real-time view. So in theory, Zeppelin’s notebook should also have such query capabilities.

2. Achieve the effect

2.1 Hive

2.1.1 Read optimized view

Image for post
Image for post

2.1.2 Real-time view

Image for post
Image for post

2.2 Spark SQL

2.2.1 Read optimized view

Image for post
Image for post

2.2.2 Real-time view

Image for post
Image for post

3. Common problems

3.1 Hudi package adaptation

Zeppelin will load the packages under lib by default when starting. For external dependencies such as Hudi, it is suitable to be placed directly under zeppelin / lib to avoid Hive or Spark SQL not finding the corresponding Hudi dependency on the cluster. …


Often times, data engineers build data pipelines to extract data from external sources, transform them and enable other parts of the organization to query the resulting datasets. While it’s easier in the short term to just build all of this as a single stage pipeline, a more thoughtful data architecture is needed to scale this model to thousands of datasets spanning multiple tera/peta bytes.

Common Pitfalls

Let’s understand some common pitfalls with the single-stage approach. First of all, it limits scalability since the input data to such a pipeline is obtained by scanning upstream databases (RDBMS-es or NoSQL stores) which would ultimately stress these systems and even result in outages. Further, accessing such data directly allows for very little standardization across pipelines (e.g: standard timestamp, key fields) as well increase risk of data brekages due to lack of schemas/data contracts. Finally, not all data or columns are available in a single place, to freely cross-correlate them for insights or design machine learning models.

Data Lakes

In recent years, the Data Lake architecture has grown in popularity. In this model, source data is first extracted with little-to-no transformation into a first set of raw datasets. The goal of these raw datasets is to effectively model an upstream source system and its data, but in a way that can scale for OLAP workloads (for e.g using columnar file formats). All data pipelines which express business specific transformation are then executed on top of these raw datasets instead. …


[Reposted from my blogger]

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on

Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely

  1. Hadoop (Hadoop2/HDFS)
  2. YARN (So you can submit MR jobs)
  3. Spark (We will illustrate with Spark Shell, but should work on YARN mode as well)
  4. Hive (So we can create some tables and play with it)

We will use the latest stable Cloudera distribution, and work off the jars. …


[Reposted from my blogger]

One more simple thing, that had relatively scarce documentation out on the Internet.

As you might know, Hadoop NameNodes finally became HA in 2.0. The HDFS client configuration, which is already a little bit tedious, became more complicated.

Traditionally, there were two ways to configure a HDFS client (lets stick to Java)

  1. Copy over the entire Hadoop config directory with all the xml files, place it somewhere in the classpath of your app or construct a Hadoop Configuration object by manually adding in those files.
  2. Simply provide the HDFS NameNode URI and let the client do the rest. …

[Reposted from my blogger]

This is a brief post on something that is rather very important. Your company probably handed you a macbook or laptop and have a Linux VM hosted somewhere, that you will do all your development on. And now the circus begins.

You like to stay on your laptop since you get all the nice IDEs and Code diffing tools and what not. But, your code only runs on the VM, rightfully so in the highly SOA (Service Oriented Architecture, basically meaning everything is REST and has nagios alerts).

So, here’s how to get the best of both worlds. …


[Reposted from blogger]

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort, given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase… Let’s explore..

Basic Idea

The 50000 ft blueprint goes like this.

  • Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don’t have true spatial indexes yet and Postgres is kick…

Traditionally, in high performance systems, repeatedly allocating and deallocating memory has been found to be costly. (i.e a malloc vs free cycle). Hence, people resorted to building their own memory pool on top of the OS, dealing with fragmentation/free list maintenance etc. One of the popular techniques to doing this being slab allocators.

This post is about doing a reality check about the cost of explicitly doing an alloc() and free() cycle, given that most popular OS es, specifically Linux gotten better at memory allocation recently. …


[Reposted from Blogger]

I am sharing my experience with benchmarking the Voldemort Server- effectively a Java key-value storage server application on Linux (referred as ‘server’ from here on). The application handles put(k,v) and get(k), like a standard HashMap, only difference being these are calls over the network, and the entries need to be persistent.What the post talks about is generic and could apply to most java server applications.

The goal here was to run a workload against the server, in a manner that it comes off disk, so we exercise the worst case path. …

Vinoth Chandar

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store