Sentiment Analysis of Twitter Hashtags With Spark
Revisited with PixieDust & Jupyter Notebooks
Why do it again?
This is yet another blog post where I discuss the application I built for running sentiment analysis of Twitter content using Apache Spark™ and Watson Tone Analyzer. Before you quit reading, let me assure you that there is a good reason to revisit this code, and that you will hopefully learn something new.
Let’s first recap the first two installments:
- In Part 1, I built a Spark Streaming application in Scala that I invoked in a Scala Notebook to fetch live Twitter content. I then used a Python Notebook to build analytics on the data.
- In Part 2, I tried to improve the application to update in real time. To do that, I ported the analytics in the Scala Spark Streaming piece, sending the results to a dashboard via MessageHub events (based on Apache Kafka™). The dashboard was a Node.js application that displays the D3 charts (deployed on IBM Bluemix), which were updated continuously from the data received via MessageHub.
Great! But a few things about the application still bothered me:
- The application was accessible to developers, but other users found it hard to deploy. To be fair, configuring the Node.js dashboard, the MessageHub service, the Weather service, the Watson Tone Analyzer service, and Spark Streaming was error prone. You had to ensure that the credentials, Kafka topics, etc. all matched in every place. Pretty hateful no?
- Data scientists lost the flexibility to further analyze the results. Everything was now in black boxes, namely the Spark Streaming Scala app and the Node.js dashboard code.
Can we do better?
For the third version of this app, I had two basic goals:
- Use only one flavor of data science notebook. Asking people to use a Scala Notebook first and then a Python Notebook created a lot of friction in previous versions of the app.
- No deployment and configuration changes should be required for the front-end.
The user story should be simple: a developer, data scientist, or line-of-business user should be able to run the application end-to-end from within a single Python Notebook. Difficult? Yes. Impossible? No.
Now that I’ve piqued your interest, read on!
PixieDust
To achieve my goals I needed three capabilities:
- The ability to run Scala code from within a Python Notebook (the Spark Streaming application from Part 1 requires Scala).
- The ability to install third-party Java packages into a Python Notebook.
- The ability to run a fully functional UI from within a Python Notebook.
Thankfully, PixieDust provides these three capabilities and more. You can learn more about PixieDust on GitHub: https://github.com/ibm-cds-labs/pixiedust. There’s also an intro post: PixieDust: Magic for Your Python Notebook.
You can easily install the PixieDust Python module in your notebook by using the following command:
!pip install --user —-upgrade pixiedust
Running Scala from a Python Notebook
In this section, we’ll recreate the app from Part 1, using only one Python Notebook.
Step 1: Install the Spark Streaming Scala application into your Python Notebook using the installPackage
API.
import pixiedust
jarPath = "https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/streaming-twitter-assembly-1.6.jar"
pixiedust.installPackage(jarPath)
You should see the following output:
Look out for the red message asking you to restart the kernel. There’s more information on the PixieDust package manager.
Step 2: Store your credentials. We’ll reuse them a lot, so it’s best to store them in Python variables. We’ll then use the PixieDust Scala bridge to make Python variables available in Scala code.
Replace the
XXXX
below with your own credentials as explained in Part 1.
twitterConsumerKey = "XXXX"
twitterConsumerSecret = "XXXX"
twitterAccessToken = "XXXX"
twitterAccessTokenSecret = "XXXX"
toneAnalyzerPassword = "XXXX"
toneAnalyzerUserName = "XXXX"
Step 3: Run the Spark Streaming app. We can now use the %%scala
magic to write the code that runs the Spark Streaming application. Notice how we are using the credential variables declared above without the need to declare them explicitly—all thanks to PixieDust variable auto-binding.
%%scala
val demo = com.ibm.cds.spark.samples.StreamingTwitter
demo.setConfig("twitter4j.oauth.consumerKey",twitterConsumerKey)
demo.setConfig("twitter4j.oauth.consumerSecret",twitterConsumerSecret)
demo.setConfig("twitter4j.oauth.accessToken",twitterAccessToken)
demo.setConfig("twitter4j.oauth.accessTokenSecret",twitterAccessTokenSecret)
demo.setConfig("watson.tone.url","https://gateway.watsonplatform.net/tone-analyzer/api")
demo.setConfig("watson.tone.password",toneAnalyzerPassword)
demo.setConfig("watson.tone.username",toneAnalyzerUserName)//Run the Spark streaming for a limited time
import org.apache.spark.streaming._
demo.startTwitterStreaming(sc, Seconds(30))
You should see the following results:
Starting twitter stream
Twitter stream started
Tweets are collected real-time and analyzed
To stop the streaming and start interacting with the data use: StreamingTwitter.stopTwitterStreaming
Receiver Started: TwitterReceiver-0
Batch started with 105 records
Batch completed with 105 records
Batch started with 246 records
Stopping Twitter stream. Please wait this may take a while
Receiver Stopped: TwitterReceiver-0
Reason: : Stopped by driver
Batch completed with 246 records
Twitter stream stopped
You can now create a sqlContext and DataFrame with 24 Tweets created. Sample usage:
val (sqlContext, df) = com.ibm.cds.spark.samples.StreamingTwitter.createTwitterDataFrames(sc)
df.printSchema
sqlContext.sql("select author, text from tweets").show
Here’s more information on the PixieDust Scala bridge.
Step 4: Collect tweets enriched with Tone Analyzer scores and move them into a Spark DataFrame. Wait until Spark Streaming has finished and run the following cell:
%%scala
val demo = com.ibm.cds.spark.samples.StreamingTwitter
val (__sqlContext, __df) = demo.createTwitterDataFrames(sc)
In this cell, we again use Scala to call the createTwitterDataFrames
API. Notice how we add special characters __
to each of the variables. The underscores tell the Scala bridge that these variables need to be bound as Python variables. We’ll use them again to do some data science.
You should see the following results:
A new table named tweets with 24 records has been correctly created and can be accessed through the SQLContext variable
Here's the schema for tweets
root
|-- author: string (nullable = true)
|-- userid: string (nullable = true)
|-- date: string (nullable = true)
|-- lang: string (nullable = true)
|-- text: string (nullable = true)
|-- lat: double (nullable = true)
|-- long: double (nullable = true)
|-- Anger: double (nullable = true)
|-- Disgust: double (nullable = true)
|-- Fear: double (nullable = true)
|-- Joy: double (nullable = true)
|-- Sadness: double (nullable = true)
|-- Analytical: double (nullable = true)
|-- Confident: double (nullable = true)
|-- Tentative: double (nullable = true)
|-- Openness: double (nullable = true)
|-- Conscientiousness: double (nullable = true)
|-- Extraversion: double (nullable = true)
|-- Agreeableness: double (nullable = true)
|-- EmotionalRange: double (nullable = true)
Step 5: Do some data science. (See? I told you.) We are now able to run the same analytics from the Python Notebook in Part 1:
tweets=__df
tweets.count()
display(tweets)
Here we use the PixieDust
display
API to explore the data. The PixieDust GitHub wiki again has everything you need to know aboutdisplay
.
You should see the following results:
You can optionally run the other analytics from Part 1. They are pretty much the same.
What about the line-of-business user?
The previous section represents a great improvement over Part 1, as we now can run the application end-to-end from within a single Python Notebook. However, we still have a lot of code and syntax to deal with, and we don’t yet have real-time analytics.
In this section, we’ll show how to create a PixieDust embedded UI with real-time analytics.
Step 1: Create a PixieDust plugin. Create a new Github repo and add a setup.py file as follows:
from setuptools import setup
setup(name='pixiedust_twitterdemo',
version='0.3',
description='Pixiedust demo of the Twitter Sentiment Analysis tutorials',
url='https://github.com/ibm-cds-labs/pixiedust_incubator/tree/master/twitterdemo',
install_requires=['pixiedust'],
author='David Taieb',
author_email='david_taieb@us.ibm.com',
license='Apache 2.0',
packages=['pixiedust_twitterdemo'],
include_package_data=True,
zip_safe=False)
See it in context on GitHub in the pixiedust_incubator/twitterdemo repo.
Step 2: Create your controller class. The controller class tells PixieDust when to trigger the display class. It must inherit from DisplayHandlerMeta found in display.py. Then, in init.py:
from pixiedust.display import *
...class PixieDustTwitterDemoPluginMeta(DisplayHandlerMeta): @addId
def getMenuInfo(self,entity):
if entity==self.__class__:
return [{"id": "twitterdemo"}]
else:
return [] def newDisplayHandler(self,options,entity):
return PixieDustTwitterDemo(options,entity)
Step 3: Create the display class. This class contains the logic for processing and displaying the results. It must inherit from the Display class found display.py. Then, in twitterDemo.py:
class PixieDustTwitterDemo(Display):
...
def doRender(self, handlerId):
self.addProfilingTime = False
stream = self.options.get("stream") if stream is None:
self._addScriptElement("https://d3js.org/d3.v3.js", checkJSVar="d3",
callback=[self.renderTemplate("demoPieChart.js"), self.renderTemplate("demoGroupedChart.js")]
)
self._addHTMLTemplate("demoScript.html")
self._addHTMLTemplate("demo.html") elif stream is True or str(stream).lower() == 'true':
self.startStream() elif stream is False or str(stream).lower() == 'false':
self.stopStream()
def genStartStreamingExecuteCode(self):
return self.renderTemplate("startStreaming.execute",
channel = StreamingChannel.__module__ + "." + StreamingChannel.__name__,
receiver = "com.ibm.cds.spark.samples.PixiedustStreamingTwitter$",
scalaCode = "val demo = com.ibm.cds.spark.samples.PixiedustStreamingTwitter;demo.startStreaming();print(\\\"done\\\")")
There’s is a lot happening in this example. If you’re interested in diving in deeper, we’ll publish a hello-world example in January to help you follow along.
Let’s look at a few key points: the doRender method is called by the framework to process and render the visualization. The class is passed the following objects:
- self.entity: the entity containing the data being displayed (not used above, but available for use)
- self.options: dictionary of state variables
- self._addHTMLTemplate: helper method that takes a path to a jinja2 template; the template is guaranteed to be passed a few variables like entity, prefix, etc.
- self._addScriptElement: helper method that lets you insert a JavaScript file into the browser client
I’m glossing over many details in this article. I encourage you to study the code to understand how I’m creating a dialog box for the UI from within the demo.html template—specifically, how I use the basedialog.html macro.
Step 4: Create your simple API. We are now ready to create a simple wrapper API that the user can just call from a notebook. In init.py:
def twitterDemo():
display(PixieDustTwitterDemoPluginMeta)
The twitterDemo method is wrapping a call to display, passing data that is specific to this plugin.
Step 5: Use our new application in a notebook. First, set the credentials again, this time for the PixiedustStreamingTwitter class:
%%scala
val demo = com.ibm.cds.spark.samples.PixiedustStreamingTwitter
demo.setConfig("twitter4j.oauth.consumerKey",twitterConsumerKey)
demo.setConfig("twitter4j.oauth.consumerSecret",twitterConsumerSecret)
demo.setConfig("twitter4j.oauth.accessToken",twitterAccessToken)
demo.setConfig("twitter4j.oauth.accessTokenSecret",twitterAccessTokenSecret)
demo.setConfig("watson.tone.url","https://gateway.watsonplatform.net/tone-analyzer/api")
demo.setConfig("watson.tone.password",toneAnalyzerPassword)
demo.setConfig("watson.tone.username",toneAnalyzerUserName)
demo.setConfig("checkpointDir", System.getProperty("user.home") + "/pixiedust/ssc")
And then in the next cell, call the twitterDemo API:
from pixiedust_twitterdemo import *
twitterDemo()
You should see a dialog that lets you specify word filters in the top left corner. It also shows you the tweets in tiles as they arrive. Clicking on an individual tweet tile will reveal the Watson Tone Analyzer scores for that tweet. Click the Start Streaming button, and you should start seeing tweets that match your filters. The UI also displays real-time sentiment analysis charts. To go back to the notebook, simply click the Back to Notebook button. The dialog is dismissed, and a new Spark DataFrame called __tweets
containing the collected tweets is created. Feel free to run the analytics above on this DataFrame.
Finally, you can create and customize a real-time analytics environment from the context of a Python Notebook, with the UI elements your line-of-business colleagues typically expect.
Running the application from GitHub
While the application can be run from any Jupyter Notebook environment, the instructions outlined below will use the IBM Data Science Experience (DSX). The first step is to get the Twitter sentiment with Pixiedust notebook into DSX:
- Sign into DSX
- Create a new project (or select an existing project)
- Add a new notebook within the project:
- Click add notebook
- Choose From URL
- Enter notebook name
- Enter the notebook URL: https://raw.githubusercontent.com/ibm-cds-labs/pixiedust_incubator/master/twitterdemo/notebook/Twitter%20Sentiment%20with%20Watson%20and%20PixieDust.ipynb
- Select the Spark Service
- Click Create Notebook
If prompted, select a kernel for the notebook. The notebook should successfully import.
Conclusion
There you have it: the third installment of the Twitter sentiment application is an elegant solution running entirely in a Python Notebook—and it’s easy to install and configure, thanks to the magic of PixieDust.
I currently have no plan for a fourth installment, but if you have an idea you’d like to share, please contact me via email or Twitter. I can’t wait to hear the many good ideas to improve this application.