Reasons to not install Hadoop on Windows
A few years ago, I was hearing from my colleagues, “don’t ever think about installing Hadoop on Windows operating system!”. I was not convinced of this saying because I am a big fan of Microsoft products, especially Windows.
In the past few years, I worked on several projects where we were asked to build a Big Data ecosystem using Hadoop and related technologies on Ubuntu. It was not so easy to work with these technologies, especially since there is a lack of online resources. Last month, I was asked to build a Big data ecosystem on Windows. Three technologies must be installed: Hadoop, Hive, and Pig.
At the end of the project, the only consequence I have is that “Think 1000 times before installing Hadoop and related technologies on Windows!”.
In this article, I will briefly describe the main reasons for this consequence.
Are these technologies developed to run on Windows?
The first releases of Hadoop were demonstrated and tested on GNU\Linux, while it was not tested under the Win32 operating system. For Hadoop 2.x and newer releases, Windows support was added, and a step-by-step guide was provided within the official documentation.
This should be ok if you are going only to install Hadoop. But when it comes to other related technologies such as Apache Hive, not all releases are supported (For apache Hive only 2.x releases supports Windows). You will need to do some hacks and workaround to install these technologies, such as using Cygwin utility to execute GNU/Linux shell commands or to copy cmd scripts from other releases.
Besides, some services may not work correctly. As an example, we installed Apache Hive and Apache Pig. We tried to connect with those technologies through the WebHCat Rest API or using the Microsoft Hive ODBC driver. The connection was made successfully, but we couldn’t execute any command since any simple command was throwing time out exceptions.
In brief, Windows is not as stable or as supported as Linux.
Other Reasons
Cost
One other reason is the licensing cost. Linux is a free and open-source operating system. It will be costly if you need to deploy a multi-node Hadoop cluster on Windows machines.
Lack of resources
In general, Big Data technologies don’t have many online resources. But, most of those resources are related to Linux while you may struggle in a small issue related to a Windows environment. Even if you ask for support on an online community like Stack Overflow, most experts work with Cloud-based Hadoop clusters or a Linux on-premise installation.
What to do if you are using Windows?
If you are using Windows, and need to use Hadoop and related technologies, you may:
- Use Linux virtual machines to install Hadoop, note that your machine must have sufficient resources.
- Use a cloud-based Hadoop service such as Microsoft Azure Hadoop cluster.
- If you cannot go with any of those suggestions and you have to install Hadoop, you need first to search for the supported releases of all required technologies (Hadoop, Hive, Spark …). Then, you must choose the compatible ones. As an example, Hive 2.x releases only supports Windows while 3.x needs some hacks, and not all its features may work properly. So, if you need to install Apache Hive, you have to use Hadoop 2.x releases since newer releases are not compatible with Hive 2.x releases.
References
- Data Developers, Don’t install Hadoop on Windows!
- Stackoverflow.com QA website
- Quora QA website
- Cloudera online community