A simple example of using Spark in Databricks with Python and PySpark.

Published in

Go Wombat

4 min readMay 28, 2019

Fresh new tutorial: A free alternative to tools like Ngrok and Serveo

Apache Spark is an open-source distributed general-purpose cluster-computing framework. And setting up a cluster using just bare metal machines can be quite complicated and expensive. Therefore cloud solutions are coming to help us. It’s really useful especially when you’re just a developer who needs to do some experiments with data.

Databricks provides a very fast and simple way to set up and use a cluster. It’s also has a community version that you can use for free (that’s the one I will use in this tutorial). There is also Azure Databricks and AWS Databricks for usage besides development.

Create an account and let’s begin.

Setting up the cluster

So, as I said, setting up a cluster in Databricks is easy as heck. Just click “New Cluster” on the home page or open “Clusters” tab in the sidebar and click “Create Cluster”. You will see a form where you need to choose a name for your cluster and some other settings. For this tutorial — all of the settings except for name you can leave with default values.

Preparing the data

For this example, I chose the data from the Google Play Store:

Google Play Store Apps

Web scraped data of 10k Play Store apps for analyzing the Android market.

www.kaggle.com

After downloading CSV with the data from Kaggle you need to upload it to the DBFS (Databricks File System). When you uploaded the file, Databricks will offer you to “Create Table in Notebook”. Let’s accept the proposal.

You will end up with a notebook like that. Note: for that dataset, you’ll need to set first_row_is_header = “true”, it is false by default.

googleplaystore - DBFS Example - Databricks

Example of the initial data table in notebook

databricks-prod-cloudfront.cloud.databricks.com

Data preparation

When reading CSV files into dataframes, Spark performs the operation in an eager mode, meaning that all of the data is loaded into memory before the next step begins execution, while a lazy approach is used when reading files in the parquet format. Generally, you want to avoid eager operations when working with Spark.

To do that, we first need to remove spaces from columns names. I decided to convert it to snake case:

After converting the names we can save our dataframe to Databricks table:

df.write.format("parquet").saveAsTable(TABLE_NAME)

To load that table to dataframe then, use read.table:

df = spark.read.table(TABLE_NAME)

Processing the data

Let’s find an average rating for each category and try to understand which apps customers are happy to use.

Code and result of processing the ratings of the apps

Don’t worry, I will describe all we have done above now.

UDF (@udf(‘[output type]’) decorator) — User defined functions. PySpark UDFs work in a way similar to the pandas’ .map() and .apply(). The only difference is that with PySpark UDF you have to specify the output data type. All the types supported by PySpark can be found here.

rename_category function — that’s a simple function to rename categories to a little bit more human-readable names. Wrapped as UDF function.

display — databricks’ helper to simply display dataframe as a table or plot a graph of it.

select(rename_category(‘category’).alias(‘category’), ‘rating’) — same as in SQL selects columns you specify from the data table. Here we’re selecting columns category and rating from our table. Also, we use our UDF rename_category to rename our categories. alias is used on the renamed result to make the column have the same name.

where(df.rating != ‘NaN’) — filters out rows in which rating is NaN.

groupBy(‘category’) — grouping as in SQL query, to aggregate data based on the groups. So we grouping ratings of the apps by their category, so then we will be able to find an average rating for each category.

agg(avg(‘rating’).alias(‘avg_rating’)) — apply aggregation on the grouped data. avg — calculate an average. alias is used in the same way, to give a column with aggregated data new name. By default, it would be avg(rating).

where(col(‘avg_rating’).isNotNull() & (col(‘avg_rating’) <= 5.0)) — filter out rows with the Null rating, or rating bigger than 5 (because there are only 5 stars in the Google Play Store). That is required because dataset has some invalid data.

sort(‘avg_rating’) — simply sort data by average rating.

As a result, we will be able to see a bar chart in the databricks, with average rating per category.

After getting an average rating, I decided to check an average downloads amount for each category as well. There is not a big difference in data processing, but one tricky part.

Because Google Play Store does not show the actual amount of downloads, but only shows the amount in a format like 10,000+ (more than 10,000 downloads). We can’t just simply apply aggregation on installs column since it’s a string.

Code and result of processing the installs amount of the apps

So, to convert installs from string format to the integers we can use UDF. A simple function convert_installs that removes symbols+, , and converts it to an integer. If the resulting number is not an integer simply returns None.

Results

The resulting notebook I got you can find here:

Google Play Store Databricks example - Databricks

Result of the data processing