Hadoop Vectored IO: your data just got faster!

Published in

Engineering@Cloudera

6 min readSep 21, 2023

Over the years, the field of big data has undergone significant transformations, with data volumes increasing from terabytes to hundreds of petabytes, and a shift from clusters with data stored in attached disks to remote cloud storage. However, the original Apache Hadoop POSIX-based file APIs have remained largely unchanged.

While these APIs have served their purpose well, there is room for improvement when it comes to working with remote object stores. The Hadoop Vectored IO API is specifically designed to better suit object stores, particularly for columnar data libraries like ORC and Parquet. By migrating a few libraries to these APIs, significant performance improvements can be achieved across all big data applications.

The S3A connector stands out as the first object store to offer a customized implementation, enabling parallel and asynchronous reading of different data blocks. In benchmarks conducted with Apache Hive using a modified ORC library, a notable 2x speedup was observed compared to using the classic S3A connector through the POSIX APIs.

These advancements in leveraging the Vectored IO API and tailored object store connectors, such as S3A, demonstrate the potential for substantial performance gains in big data applications, particularly when working with remote object stores like S3.

Problem: Reading large data from cloud storage is slow and costly.

Slow queries in big data applications like Spark, Hive, and Impala, using columnar file formats (ORC, Parquet) in cloud storage (S3, ABFS, GCS) are caused by lots of seek() operations as they read data from different column stripes in the files. These operations require multiple HTTP GET requests, resulting in slow performance, longer cluster durations, or even the need for additional VMs -with the subsequent increase in cloud computation bills,

It is also worth noting that the POSIX API is also suboptimal for SSD storage, where DMA block memory reads can deliver high-performance block reads through OS API calls such as Linux’s readV() API. Our API was designed to support this use case as well as high-latency cloud storage.

Solution: Migrate your workloads to utilize Vectored IO-enabled ORC and Parquet.

Although the seek()/read() operation sequences may seem unpredictable, the ORC and Parquet formats have a good understanding of the data ranges they need to read. The introduction of the Vectored IO API allows these file formats to specify a set of data ranges to fetch in a single operation, rather than issuing individual read calls for each range. This API is asynchronous, enabling the libraries to perform other tasks while waiting for the data. Furthermore, different implementations of this API can support additional optimizations such as merging close-by data ranges and fetching them in parallel from remote cloud storage. These improvements contribute to faster and more efficient data retrieval from cloud storage.

API Specification:

A new method has been added to the long-standing PositionedReadable interface -this interface is broadly available, and by adding a new default implementation we automatically add to every existing input stream which implements the interface.

public interface PositionedReadable {
default void readVectored(
 List<? extends FileRange> ranges,
 IntFunction<ByteBuffer> allocator)
 throws IOException
}

public interface FileRange {
/**
* Get the starting offset of the range.
* @return the byte offset of the start
*/
long getOffset();
/**
* Get the length of the range.
* @return the number of bytes in the range.
*/
int getLength();
/**
* Get the future data for this range.
* @return the future for the {@link ByteBuffer} that contains the data
*/
CompletableFuture<ByteBuffer> getData();
/**
* Set a future for this range's data.
* This method is called by {@link PositionedReadable#readVectored} to store the
* data for the user to pick up later via {@link #getData}.
* @param data the future of the ByteBuffer that will have the data
*/
void setData(CompletableFuture<ByteBuffer> data);
/**
* Get any reference passed in to the file range constructor.
* This is not used by any implementation code; it is to help
* bind this API to libraries retrieving multiple stripes of
* data in parallel.
* @return a reference or null.
*/
Object getReference();
}

A FileRange is a simple object with the file offset and length of data to read, a data byte[] array, and an optional Object reference to assist library use.

The new readVectored() method API takes a list of FileRanges and a method to allocate byte buffers. It then fills the data array with the file data downloaded into allocated buffers.

Range reads can be asynchronous and out of order: it is up to the FileSystem implementations to optimize.

In Hadoop 3.3.5, there are three implementations

The default implementation, which uses the existing PositionedReadable.readFully() method, fetches each FileRange’s data in sequence. This has the exact same performance as the application making the seek and read calls themselves.
The Local Filesystem implementation uses the Java NativeIO parallel scatter/gather API for maximum performance reading local data, using the readV() API or equivalent underneath.
S3Filesystem implementation which coaleases the nearby ranges and fetches the data in parallel.

S3A Architecture: parallelized GETs of coalesced ranges

Performance benchmarks:

Based on our benchmarks, the performance improvements achieved are as follows:

TPCH Hive queries using ORC data stored in S3 exhibited a 10–20% reduction in execution time. The data size used was 300 GB.

TPCDS Hive queries with ORC data stored in S3 demonstrated a 20–40% reduction in execution time. The data size used was 300 GB.

A synthetic ETL benchmark displayed a significant improvement, with execution times reduced by 50–70%. The data size used was 250 GB with strings. As we can see from the results below, long-running queries show more improvements because vectored IO internally will be running a lot of parallel reads.

The Java Microbench benchmark for local storage demonstrated a 7x increase in speed.

These results highlight the notable performance enhancements achieved by leveraging the Vectored IO-enabled ORC and Parquet formats. By optimizing data retrieval and leveraging asynchronous processing, these benchmarks demonstrate significant gains in query execution efficiency and overall workload performance.

For more details please refer to the paper.

https://www.slideshare.net/MukundThakur22/acna2022-hadoop-vectored-io-your-data-just-got-fasterpdf

API availability in releases:

The readVectored() API is available in Hadoop 3.3.5. To actually take advantage of it, the ORC and Parquet libraries need to be upgraded to use the new method.

Any application using a suitably upgraded library will automatically gain the speedups -it is only the ORC and parquet libraries that need to be changed.

The latest Cloudera public cloud runtime 7.2.17 contains the updated ORC library with vectored IO support.

The updated Parquet library is still being qualified and benchmarked and will be released in upcoming CDP runtime releases.

The patches needed for the new API are on GitHub as ORC and Parquet pull requests ORC-1251 and PARQUET-2171. Here is where the challenge of coordinating backward-incompatible changes across the open-source big data stack becomes significant: some form of Java reflection is going to be needed to boost the performance of these libraries on Hadoop 3.3.5+ while still allowing them to compile and run against older releases. We are working on a library to assist with this https://github.com/apache/hadoop-api-shim , however, it may actually be better to implement something similar in both parquet and ORC, just to avoid adding yet another dependency to the mix.

Future work:

As of now vectored IO API is implemented for S3A and Local File systems. We would like to extend this implementation to the Azure and Google Cloud storage connectors as well. The ozone team is also planning to support Vectored IO in OFS protocol HDDS-7298.

References:

Apache Conference Talk

Add a high-performance Vectored read API.

Support VectoredIO in ORC

Support vectored IO in Parquet

Hadoop Vectored IO: your data just got faster!

Written by Mukund Thakur