The Most Popular Search Term at SXSW, According to Our Chatbot

Analyzing conversation data using Spark, Jupyter, and PixieDust

Published in

Center for Open Source Data and AI Technologies

3 min readApr 10, 2017

A few weeks ago Raj Singh and I demonstrated a chatbot called Cognitive Event Finder at IBM’s installation at SXSW. The chatbot allowed users to search for tech sessions, music gigs, or film screenings from our mobile-optimized web app or via SMS using Twilio. We’re still running the demo app, and we also wrote up its architecture. Of course, it’s also on GitHub.

In this article, I’ll look at how we analyzed conversations with our bot. I’ll also provide a Jupyter notebook that Raj and I worked on, which hopefully gives you ideas on how to connect application logs to Apache Spark™ for back-end analytics. The notebook takes you through data access, all the way to visualizing results.

The chatbot in action. We had about 500 users, with a bias toward tech as part of SXSW Interactive.

Logging conversations

We used Cloudant to log conversations, including every node of the dialog tree traversed in the Watson Conversation service. We captured this data for several reasons:

To allow users to recall previous searches. (At any point a user can say “show me my recent searches,” and the chatbot pulls them from Cloudant and displays them to the user.)
To analyze conversations to see what type of experience users were having with the chatbot. (Answer: mixed, but that’s a story for another day.)
To compile fun stats!

I’m here to talk about #3. Raj and I built a Jupyter notebook that you can access in the IBM Data Science Experience (DSX), which wraps up Spark with notebooks, object storage, and other handy features. Here’s the basic flow of the notebook:

Access Cloudant data directly from DSX.
Transform Cloudant JSON into relational table-like structures required to use the data efficiently in Spark.
Use PySpark operations to shape the data for analysis.
Visualize application usage patterns via the PixieDust Python helper library for Spark notebooks.

In my case, the usage pattern I care about the most is popular search terms.

There are no data science notebooks in the movie “The Notebook,” but if there were it would be more exciting.

The Notebook (2004)

Again, here’s the notebook on IBM’s Data Science Experience. If you plan on doing anything more than just viewing it, you’ll want to download it via the icon in the menu bar.

When you walk through its cells, you will see how we import data from Cloudant. Here’s the basic structure of our JSON documents. Each one represents a single conversation between the user and the chatbot:

This user wanted to check out brass bands at SXSW.

In the notebook, you’ll see how we flatten this data for analysis. In this case, we transform this one document into two rows — one row for each node in the dialog tree:

The relational representation of our Cloudant JSON. It’s a Spark SQL DataFrame, visualized by PixieDust as a simple table.

Ultimately, the notebook shows how we found the most popular tech event search. (It was “AI”.) You’ll also see how we used PixieDust to visualize a range of popular search terms in a single line of code:

A pie chart, courtesy of the display() API in PixieDust.

Backing into the end

That’s the overview of the back end of the chatbot I built with Raj. We hope it shows how you can quickly design a plan for monitoring and analyzing application usage, all using cloud-based persistence and analytics services.

So to conclude: feel free to play with the example we set up by uploading the notebook into your own DSX account. What other interesting trends can you find? And if you’ve enjoyed this article, please remember to recommend it on Medium using the ♡ here.

The Most Popular Search Term at SXSW, According to Our Chatbot

Analyzing conversation data using Spark, Jupyter, and PixieDust

Logging conversations

The Notebook (2004)

Backing into the end

Written by Mark Watson