Introduction to Spark GraphX
With the emerging of Apache Spark analysis of big data became essential easier, Spark brings a lot implementation of useful algorithms for data mining, data analysis, machine learning, algorithms on graphs. Spark takes on the challenge of implementing sophisticated algorithms with tricky optimization and ability to run your code on distributed cluster, Spark effectively solve problems like fault tolerance and provide simple API to make the parallel computation.
In this article I would like to tell you about the significant part of Spark — GraphX, it’s a component for graphs and graph-parallel computation. GraphX reuses Spark RDD concept, simplifies graph analytics tasks, provides the ability to make operations on a directed multigraph with properties attached to each vertex and edge. There a lot of algorithms in graph theory which is widely used in data analysis and computer science from computers network to the searching engine and social networks. GraphX provides API for fast and robust development which is related with leveraging graphs.
Let’s start with GraphX and unleash the power of Spark for data analysis. Graphs are a perfect data structure for describing social network — for this reason, companies like Facebook make the accent on developing software in this direction.
Let’s consider an example how to use Spark GraphX for analysis of social graph of users on the social network. Firstly, we need to users data, in this example we will use two TSV (data separated by tabs) files — first describes user’s metadata — it’s a simple tuple in form user_id -> user_login (I took fake user names for educational purposes), and the second file — it describes connections of users — the first number in the row is user’s id, remainder is connections of this user — see snapshot below or links on originals on github:
5988 748 1722 3752 4655 5743 1872 3413 5527 6368 6085 4319 4728 1636
5989 4080 4264 4446 3779 2430 2297 6169 3530 3272 4282 6432 2548 5982 217 595 1194 3308 2940 1815 794 1503 5197 859 5096 6039 2664 651 2244 528 284 1449 1097 1172 1092 108 3405 5204 387 4607 4545 3705 4930 1805 4712 4404 247 4754 4427 1845 536 5795 5978 533 3984
Firstly we need to parse data from file with names — method parseNames and build edges between connection which will be obtained from the second file:
Next we construct graph:
Let’s find the most connected users in our social network we need to join our graph with verts and sort by connections.
Next task is searching for the degree of separation in our social network between the single user and other users and between two users. Degrees of separation is the idea that all living things and everything else in the world are a few steps away from each other so that a chain of “a friend of friend”.
For the searching connection between users was used Breadth-first search algorithm with Pregel. Breadth-first search (BFS) is an algorithm for traversing or searching on the graph, it starts from the root — in our case its id of the user defined as VertexId, and explores the neighbor nodes first, before moving to the next level neighbors. Pregel is a data flow paradigm and system for large-scale graph processing created at Google to solve problems that are hard or expensive to solve using only the MapReduce framework. Pregel is essentially a message-passing interface constrained to the edges of a graph. The idea is to ”think like a vertex” — algorithms within the Pregel framework are algorithms in which the computation of state for a given node depends only on the states of its neighbors. More about - Pregel.
You can find full listing of source code on my github repository.