Pangool: Hadoop API made easy

Iván de Prado
Iván’s blog
Published in
3 min readMar 5, 2012

We are proud to announce Pangool, an Open Source java library with the aim to be a replacement for the Hadoop API. Hadoop has a steep learning curve. Pangool’s goal is to simplify Hadoop development without losing the performance or flexibility of the low level Hadoop’s API.

Pangool

Pangool is a Tuple MapReduce implementation for Hadoop. By employing an intermediate Tuple-based schema and configuring a Job conveniently, many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear. Things like secondary sort and reduce-side joins become extremely easy to implement and understand. Pangool’s performance is comparable to that of the Hadoop Java MapReduce API. Pangool also augments Hadoop’s API by making multiple outputs and inputs first-class and allowing configuration via object instance instead of static classes.

We had discussed in previous posts most of the common difficulties when developing with Hadoop. Pangool has been born as a solution for them.

There are common patterns when writing MapReduce applications that are not easy to implement using the Hadoop API. Pangool offers easy solutions for them “out of the box”.

Features

Tuple instead key/value

By using Tuples instead of (key, value) pairs, the user is not forced to write their custom data types (e.g. Writables) or use external serialization libraries when working with more than two fields.

However Pangool’s Tuples may contain arbitrary data types — as long as they are serializable by Hadoop.

Efficient, easy-to-use secondary sorting

In Pangool you can say:

[code lang=”java”]
groupBy(“user”, “country”)
sortBy(“user”, “country”, “name”)
[/code]

Pangool will use an intelligent and efficient Partitioner, Sort and Group Comparator underneath just like an advanced user would do with the plain Hadoop MapReduce API.

Efficient, easy-to-use reduce-side joins

Doing reduce-side joins with Pangool is as simple as it can get. By using Tuples and configuring your MapReduce jobs properly, you can easily join various datasets and perform arbitrary business logic on them. Again, Pangool will know how to partition, sort and group by underneath in an efficient way.

Configuration via object instances

Mapper, Combiner, Reducers, Input / Output Formats and Comparators can be passed via object instance. This way, boilerplate configuration code is no longer needed.

First-class multiple inputs / outputs

Multiple inputs & outputs in Pangool is part of its standard API.

Input / Output Tuple formats

Tuples may be persisted and used as input to other Jobs by using TupleOutputFormat / TupleInputFormat.

Performance and flexibility

Pangool is an alternative to the Java Hadoop MapReduce API. The same things can be achieved by using one or another. Pangool’s performance is quite close to that of Hadoop’s MapReduce API. Pangool just makes life easier to those that require the efficiency and flexibility of the plain Java Hadoop MapReduce API.

Pangool performance

Conclusion

We have developed Pangool with the idea of contributing to the Hadoop community with a tool that reduces the learning curve and therefore eases the adoption of this technology to new users. We hope you’ll find it useful. That will make us happy!

Contributors

Eric Palacios
Pere Ferrera
Iván de Prado

More information

The Pangool website

--

--