Spark Standalone Mode

Since the startup I work with is fairly small, I was given the task of building the data infrastructure for it. The first thing was to create a cluster, so that we can run our algorithms on a distributed environment.

For cloud we had some options to choose from Microsoft Azure, AWS, Digital Ocean and Google cloud. 
Since there is not much difference in the rates bewteen different providers and since aws support is also top notch we decided to go with AWS. For digital ocean we considered it as an option because of the credits we had. AWS has some products that we will definetly gonna use in future so it was making perfect sense for us to go with AWS. Which ever provider you choose all are good just the use case is important. If you are going to use Microsoft technologies and tools then Azure is better for you. If you want to go with a little cheaper option Google cloud is good and reliable.

Once decided on cloud provider, I had to decide between Hadoop and Spark. I would say Spark and Hadoop complements each other more then competing with one another. You can use YARN and HDFS which were developed for Hadoop with Spark too.

When it comes to best for our use case we opted for Spark because to its inmemory computation Benchmark Apache Spark is much faster then Hadoop, also Spark comes bundeled with a ML Library which has pretty good collection of ML algorithms, Graph Computation framework, Streaming and SQL library. Another factor that worked in sparks favour was its ability to integrate R whihc is currently our main language for writing ML algorithms. Spark was developed for extending MR paradigm to iterative and interactive jobs.