Sentiment Analysis of Twitter Hashtags With Spark

Revisited with PixieDust & Jupyter Notebooks

--

Why do it again?

This is yet another blog post where I discuss the application I built for running sentiment analysis of Twitter content using Apache Spark™ and Watson Tone Analyzer. Before you quit reading, let me assure you that there is a good reason to revisit this code, and that you will hopefully learn something new.

Let’s first recap the first two installments:

  1. In Part 1, I built a Spark Streaming application in Scala that I invoked in a Scala Notebook to fetch live Twitter content. I then used a Python Notebook to build analytics on the data.
  2. In Part 2, I tried to improve the application to update in real time. To do that, I ported the analytics in the Scala Spark Streaming piece, sending the results to a dashboard via MessageHub events (based on Apache Kafka™). The dashboard was a Node.js application that displays the D3 charts (deployed on IBM Bluemix), which were updated continuously from the data received via MessageHub.
But wait! There’s good reason to revisit this code.

Great! But a few things about the application still bothered me:

  1. The application was accessible to developers, but other users found it hard to deploy. To be fair, configuring the Node.js dashboard, the MessageHub service, the Weather service, the Watson Tone Analyzer service, and Spark Streaming was error prone. You had to ensure that the credentials, Kafka topics, etc. all matched in every place. Pretty hateful no?
  2. Data scientists lost the flexibility to further analyze the results. Everything was now in black boxes, namely the Spark Streaming Scala app and the Node.js dashboard code.

Can we do better?

For the third version of this app, I had two basic goals:

  1. Use only one flavor of data science notebook. Asking people to use a Scala Notebook first and then a Python Notebook created a lot of friction in previous versions of the app.
  2. No deployment and configuration changes should be required for the front-end.

The user story should be simple: a developer, data scientist, or line-of-business user should be able to run the application end-to-end from within a single Python Notebook. Difficult? Yes. Impossible? No.

Now that I’ve piqued your interest, read on!

PixieDust

To achieve my goals I needed three capabilities:

  1. The ability to run Scala code from within a Python Notebook (the Spark Streaming application from Part 1 requires Scala).
  2. The ability to install third-party Java packages into a Python Notebook.
  3. The ability to run a fully functional UI from within a Python Notebook.

Thankfully, PixieDust provides these three capabilities and more. You can learn more about PixieDust on GitHub: https://github.com/ibm-cds-labs/pixiedust. There’s also an intro post: PixieDust: Magic for Your Python Notebook.

You can easily install the PixieDust Python module in your notebook by using the following command: !pip install --user —-upgrade pixiedust

Running Scala from a Python Notebook

In this section, we’ll recreate the app from Part 1, using only one Python Notebook.

Step 1: Install the Spark Streaming Scala application into your Python Notebook using the installPackage API.

You should see the following output:

“PixieDust database opened successfully” & more victorious output!

Look out for the red message asking you to restart the kernel. There’s more information on the PixieDust package manager.

Step 2: Store your credentials. We’ll reuse them a lot, so it’s best to store them in Python variables. We’ll then use the PixieDust Scala bridge to make Python variables available in Scala code.

Replace the XXXX below with your own credentials as explained in Part 1.

Step 3: Run the Spark Streaming app. We can now use the %%scala magic to write the code that runs the Spark Streaming application. Notice how we are using the credential variables declared above without the need to declare them explicitly—all thanks to PixieDust variable auto-binding.

You should see the following results:

Here’s more information on the PixieDust Scala bridge.

Step 4: Collect tweets enriched with Tone Analyzer scores and move them into a Spark DataFrame. Wait until Spark Streaming has finished and run the following cell:

In this cell, we again use Scala to call the createTwitterDataFrames API. Notice how we add special characters __ to each of the variables. The underscores tell the Scala bridge that these variables need to be bound as Python variables. We’ll use them again to do some data science.

You should see the following results:

Step 5: Do some data science. (See? I told you.) We are now able to run the same analytics from the Python Notebook in Part 1:

Here we use the PixieDust display API to explore the data. The PixieDust GitHub wiki again has everything you need to know about display.

You should see the following results:

You can optionally run the other analytics from Part 1. They are pretty much the same.

What about the line-of-business user?

The previous section represents a great improvement over Part 1, as we now can run the application end-to-end from within a single Python Notebook. However, we still have a lot of code and syntax to deal with, and we don’t yet have real-time analytics.

In this section, we’ll show how to create a PixieDust embedded UI with real-time analytics.

Step 1: Create a PixieDust plugin. Create a new Github repo and add a setup.py file as follows:

See it in context on GitHub in the pixiedust_incubator/twitterdemo repo.

Step 2: Create your controller class. The controller class tells PixieDust when to trigger the display class. It must inherit from DisplayHandlerMeta found in display.py. Then, in init.py:

Step 3: Create the display class. This class contains the logic for processing and displaying the results. It must inherit from the Display class found display.py. Then, in twitterDemo.py:

There’s is a lot happening in this example. If you’re interested in diving in deeper, we’ll publish a hello-world example in January to help you follow along.

Let’s look at a few key points: the doRender method is called by the framework to process and render the visualization. The class is passed the following objects:

  1. self.entity: the entity containing the data being displayed (not used above, but available for use)
  2. self.options: dictionary of state variables
  3. self._addHTMLTemplate: helper method that takes a path to a jinja2 template; the template is guaranteed to be passed a few variables like entity, prefix, etc.
  4. self._addScriptElement: helper method that lets you insert a JavaScript file into the browser client

I’m glossing over many details in this article. I encourage you to study the code to understand how I’m creating a dialog box for the UI from within the demo.html template—specifically, how I use the basedialog.html macro.

Step 4: Create your simple API. We are now ready to create a simple wrapper API that the user can just call from a notebook. In init.py:

The twitterDemo method is wrapping a call to display, passing data that is specific to this plugin.

Step 5: Use our new application in a notebook. First, set the credentials again, this time for the PixiedustStreamingTwitter class:

And then in the next cell, call the twitterDemo API:

You should see a dialog that lets you specify word filters in the top left corner. It also shows you the tweets in tiles as they arrive. Clicking on an individual tweet tile will reveal the Watson Tone Analyzer scores for that tweet. Click the Start Streaming button, and you should start seeing tweets that match your filters. The UI also displays real-time sentiment analysis charts. To go back to the notebook, simply click the Back to Notebook button. The dialog is dismissed, and a new Spark DataFrame called __tweets containing the collected tweets is created. Feel free to run the analytics above on this DataFrame.

Finally, you can create and customize a real-time analytics environment from the context of a Python Notebook, with the UI elements your line-of-business colleagues typically expect.

Running the application from GitHub

While the application can be run from any Jupyter Notebook environment, the instructions outlined below will use the IBM Data Science Experience (DSX). The first step is to get the Twitter sentiment with Pixiedust notebook into DSX:

  1. Sign into DSX
  2. Create a new project (or select an existing project)
  3. Add a new notebook within the project:

If prompted, select a kernel for the notebook. The notebook should successfully import.

Conclusion

There you have it: the third installment of the Twitter sentiment application is an elegant solution running entirely in a Python Notebook—and it’s easy to install and configure, thanks to the magic of PixieDust.

I currently have no plan for a fourth installment, but if you have an idea you’d like to share, please contact me via email or Twitter. I can’t wait to hear the many good ideas to improve this application.

--

--