This blog is a repost of this Hudi blog on medium.

Apache Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing. In a data lake/warehouse, one of the key trade-offs is between ingestion speed and query performance. Data ingestion typically prefers small files to improve parallelism and make data available to queries as soon as possible. However, query performance degrades poorly with a lot of small files. Also, during ingestion, data is typically co-located based on arrival time. However, the query engines perform better when the data frequently…

This blog is a repost of this Hudi blog on medium.

Apache Hudi employs an index to locate the file group, that an update/delete belongs to. For Copy-On-Write tables, this enables fast upsert/delete operations, by avoiding the need to join against the entire dataset to determine which files to rewrite. For Merge-On-Read tables, this design allows Hudi to bound the amount of records any given base file needs to be merged against. Specifically, a given base file needs to merged only against updates for records that are part of that base file. In contrast, designs without an indexing component (e.g…

Reposted translation of the original article :

1 Introduction

Apache Zeppelin is a web-based notebook that provides interactive data analysis. It is convenient for you to make beautiful documents that can be data-driven, interactive, and collaborative, and supports multiple languages, including Scala (using Apache Spark), Python (Apache Spark), SparkSQL, Hive, Markdown, Shell, and so on. Hive and SparkSQL currently support querying Hudi’s read-optimized view and real-time view. So in theory, Zeppelin’s notebook should also have such query capabilities.

2. Achieve the effect

2.1 Hive

2.1.1 Read optimized view

Image for post
Image for post

2.1.2 Real-time view

Often times, data engineers build data pipelines to extract data from external sources, transform them and enable other parts of the organization to query the resulting datasets. While it’s easier in the short term to just build all of this as a single stage pipeline, a more thoughtful data architecture is needed to scale this model to thousands of datasets spanning multiple tera/peta bytes.

Common Pitfalls

Let’s understand some common pitfalls with the single-stage approach. First of all, it limits scalability since the input data to such a pipeline is obtained by scanning upstream databases (RDBMS-es or NoSQL stores) which would ultimately stress these systems and even result in outages. Further, accessing such data directly allows for very little standardization across pipelines (e.g: standard timestamp, key fields) as well increase risk of data brekages due to lack of schemas/data contracts. Finally, not all data or columns are available in a single place, to freely cross-correlate them for insights or design machine learning models.

Data Lakes

In recent years…

[Reposted from my blogger]

If you are like me, who loves to have everything you are developing against working locally in a mini-integration environment, read on

Here, we attempt to get some pretty heavy-weight stuff working locally on your mac, namely

  1. Hadoop (Hadoop2/HDFS)
  2. YARN (So you can submit MR jobs)
  3. Spark (We will illustrate with Spark Shell, but should work on YARN mode as well)
  4. Hive (So we can create some tables and play with it)

We will use the latest stable Cloudera distribution, and work off the jars. …

[Reposted from my blogger]

One more simple thing, that had relatively scarce documentation out on the Internet.

As you might know, Hadoop NameNodes finally became HA in 2.0. The HDFS client configuration, which is already a little bit tedious, became more complicated.

Traditionally, there were two ways to configure a HDFS client (lets stick to Java)

  1. Copy over the entire Hadoop config directory with all the xml files, place it somewhere in the classpath of your app or construct a Hadoop Configuration object by manually adding in those files.
  2. Simply provide the HDFS NameNode URI and let the client do…

[Reposted from my blogger]

This is a brief post on something that is rather very important. Your company probably handed you a macbook or laptop and have a Linux VM hosted somewhere, that you will do all your development on. And now the circus begins.

You like to stay on your laptop since you get all the nice IDEs and Code diffing tools and what not. But, your code only runs on the VM, rightfully so in the highly SOA (Service Oriented Architecture, basically meaning everything is REST and has nagios alerts).

So, here’s how to get the best of…

[Reposted from blogger]

This weekend, I set out to explore something that has always been a daemon running at the back of my head. What would it mean to add Spatial Indexing support to Voldemort, given that Voldemort supports a pluggable storage layer.. Would it fit well with the existing Voldemort server architecture? Or would it create a frankenstein freak show where two systems essentially exist side by side under one codebase… Let’s explore..

Basic Idea

The 50000 ft blueprint goes like this.

  • Implement a new Storage Engine on top Postgres sql (Sorry innoDB, you don’t have true spatial indexes yet and Postgres is kick…

Traditionally, in high performance systems, repeatedly allocating and deallocating memory has been found to be costly. (i.e a malloc vs free cycle). Hence, people resorted to building their own memory pool on top of the OS, dealing with fragmentation/free list maintenance etc. One of the popular techniques to doing this being slab allocators.

This post is about doing a reality check about the cost of explicitly doing an alloc() and free() cycle, given that most popular OS es, specifically Linux gotten better at memory allocation recently. …

[Reposted from Blogger]

I am sharing my experience with benchmarking the Voldemort Server- effectively a Java key-value storage server application on Linux (referred as ‘server’ from here on). The application handles put(k,v) and get(k), like a standard HashMap, only difference being these are calls over the network, and the entries need to be persistent.What the post talks about is generic and could apply to most java server applications.

The goal here was to run a workload against the server, in a manner that it comes off disk, so we exercise the worst case path. …

Vinoth Chandar

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store