DataBricks Community Edition: An Entry Into Big Data With Spark
During a local Spark meetup, I was introduced to the databricks Community Edition. Within the Spark community, databricks is well-known, so I was excited when I got my early invite to try out the Community Edition.
This brief article will be a mix of overview, step-by-step instruction and opinion. However, by the end, you’ll have your very own bright and shiny Spark cluster loaded with a large dataset (a few Gb’s at least) that will be waiting on whatever fun data analysis you can throw at it.
The beauty of the databricks platform and approach is that anyone, regardless of technical background, can try this out. I’m a firm believer in getting your hands on the technology and trying to do some fun and interesting things with it immediately, so we are going to jump right into it.
Get Started With databricks Community Edition
Step 1: Sign Up For A databricks Community Edition Account.
Step 2: Sign In To Your Account
Step 3: Create A New Cluster
- From the left hand menu choose “Clusters”
- Click “Create Cluster”
- Give the cluster a name and select a Hadoop version (I went with the default)
Step 4: Load An Application Example
- From the left hand menu choose “Workspace”
- Under the “Workspace” column, choose “Shared”
- Click the down arrow next to “Shared” and choose Import
- In the “Import Notebooks” dialog, choose “Import from URL” and enter the following url: https://docs.cloud.databricks.com/docs/latest/sample_applications/Sample%20Analysis/Wikipedia%20Clickstream%20Analysis.html
- Click “Import”
I want to stop and pause to appreciate what has just happened. In four steps, you have a development Spark cluster up and running with a large dataset to play with. This saves you loads of time in installing and configuring a local Spark cluster (which can be a difficult process depending on the setup of your machine) and saves you the hassle of trying to find an interesting and large enough dataset to make working with Spark truly meaningful. There are a number of locations to grab large datasets, but if you're new to the scene, the search can seem daunting and frustrating. The dataset we are working with in this article contains clickstream data from 3.2 billion Wikipedia page requests collected during the month of February 2015.
Load & Optimize Your Data
The Wikipedia Clickstream Analysis notebook is broken up into functional code blocks that you can run individually within your cluster by clicking Shift + Enter. This is important.
Start by running the first code block. This will spin up a few distributed spark jobs to load the external data set and convert it into something that is more efficient to work with.
The first time you run a code block, you may get the following message:
Choose “Attach and Run”. This is just telling you that you're trying to run some commands but have no cluster associated with the notebook. You’ll only have to attach it once.
Let’s start by running the first code block in the notebook. This will spin up a few distributed spark jobs to load the external data set and convert it into something that is more efficient to work with.
Create A DataFrame
In Spark, a DataFrame is equivalent to a relational table in Spark SQL. For this example application, we are going to load our data into a dataframe so we can run some SQL commands at it in a bit.
Now, if we can display this DataFrame, it has a layout that may appear a bit more recognizable...
Data Analysis: What % of Traffic Comes From Within Wikipedia
We have some good data loaded into our cluster, so let’s do some analysis. What if we wanted to know how much traffic Wikipedia gets from within itself (meaning, people click on other Wikipedia pages from other Wikipedia pages)?
A few lines of code and now we know that 33% of all traffic to Wikipedia comes from inside of Wikipedia!
Next Steps: Watch & Experiment
At the time I started with the Community Edition, I immediately gravitated towards the Wikipedia Clickstream Analysis project as a starting point. I soon learned that Michael Armbrust at databricks created the notebook during the Spark Summit East 2016 demo and it was recorded. Definitely check out the video here and you can follow along with him as he works directly from the example notebook. Aside from what we covered here, he continues on into doing some interesting D3 visualizations of the data while showcasing how to run the same code against different versions of Spark, all from the same notebook.
You can check out my notebook here. Clone it, experiment and try out some of the other analysis that Michael discusses in the video.
Finally, if you're interested in taking a closer look at the dataset we used in this article, you can do that here.