Introduction Into Big Data With Apache Spark

Mikhail Raevskiy
The Startup
Published in
5 min readAug 1, 2020

--

Last time we reviewed the wonderful Vowpal Wabbit tool, which can be useful in cases when you have to train on samples that do not fit into RAM. Recall that a feature of this tool is that it allows you to build, first of all, linear models (which, by the way, have a good generalizing ability), and the high quality of algorithms is achieved through selection and generation of features, regularization and other additional techniques. Today we will consider a tool that is more popular and designed for processing large amounts of data — Apache Spark .

We will not go into the details of the history of this instrument, as well as its internal structure. Let’s focus on practical things. In this article we will look at basic operations and basic things that can be done in Spark, and next time we will take a closer look at the machine learning MlLib library , as well as GraphX.for processing graphs (the author of this post mainly uses this tool for this — this is exactly the case when the graph often needs to be kept in RAM on the cluster, while Vowpal Wabbit is often enough for machine learning). There won’t be a lot of code in this tutorial, because discusses the basic concepts and philosophy of Spark. In the next articles (about MlLib and GraphX) we will take some dataset and take a closer look at Spark in practice.

--

--

Mikhail Raevskiy
The Startup

Bioinformatician at Oncobox Inc. (@oncobox). Research Associate