November 12, 2015
The Cask Data Application Platform is an integrated developer platform for the Hadoop ecosystem. With CDAP, developers can address a broader set of batch and real-time use-cases with easy-to-use abstractions. Developers can write MapReduce programs using CDAP and deploy them as CDAP applications easily, as explained in this guide.
Running MapReduce programs inside CDAP has many advantages over vanilla Hadoop like ease of deployment and manageability of applications, APIs for extracting metrics and logs from applications, and simple programmatic APIs for scheduling.
In this blog post, we will cover how developers can port their existing legacy MapReduce code easily into CDAP and enjoy some of the benefits that CDAP provides for application manageability, monitoring and scheduling, and use CDAP’s powerful data abstractions.
Running a Legacy MapReduce Jobs inside CDAP
Running legacy MapReduce jobs in CDAP can be done in four easy steps:
- Create a base CDAP application from archetype
- Add the legacy MapReduce jar as a maven dependency
- Port the MapReduce driver code from main() to the CDAP application for setting the Mapper/Reducer and inputs
- Build, deploy and run the application!
In the following sections, we will take a deeper look on the steps discussed above using the legacy WordCount application as an example.
1. Creating a CDAP application from an archetype:
Use the following Maven command to create an application from the maven archetype:
$ mvn archetype:generate \
-DarchetypeGroupId=co.cask.cdap \
-DarchetypeArtifactId=cdap-app-archetype \
-DarchetypeVersion=3.2.1
This archetype will create a HelloWorld application which you can edit to suit your needs.
2. Adding the jar containing the legacy MapReduce job as a Maven dependency:
You can make your legacy MapReduce job available to your CDAP application by adding it to your application’s pom.xml file as a Maven dependency.
Edit the pom file to add the jar containing the MapReduce code as a dependency. This will allow using the Mapper and Reducer class in the CDAP application.
3. Writing MapReduce Driver:
After making your legacy MapReduce program available to your CDAP application you will need to set the Mapper and Reducer class in your CDAP application to your program’s map and reduce class in the same way as you do in a typical main() of a MapReduce program. You define this in the beforeSubmit() of the class which extends AbstractMapReduce in your application.
The driver class will have following structure:
The next step is to configure the input and output for the MapReduce job. A typical MapReduce job needs an input and output which are generally provided as command line arguments while running the MapReduce program from the command line and used to configure the input and output of the MapReduce job in the main() function of the program. You can provide these input and output paths to your legacy MapReduce program running inside CDAP as runtime arguments. For example, Hadoop’s classic Word Count example needs an input and output file path which we will provide to our MapReduce program ‘ClassicWordCountDriver’ as runtime arguments, and use these paths in beforeSubmit() by making a call to the preparePaths() function which set these paths for the job.
4. Build, Deploy and run the MapReduce application
After downloading the example you will need CDAP to run it. You can download and install the CDAP SDK by following these instructions. Once you have the CDAP SDK up and running follow the instructions from the README.md in the downloaded code to build and run the example.
Congratulations! You just ran a legacy MapReduce program in CDAP without any code changes. In a following blog post we will show how you can make your legacy MapReduce program harness much more from CDAP, without any code changes.
In the meantime, to learn more about the many possibilities with CDAP check out our guides that include a hands-on introduction to various CDAP concepts,