Analyzing Zendesk Data in Python Notebook (in Mode!)

Yevgeniy S. Meyer
5 min readMay 23, 2016

--

At RJMetrics, we use a variety of tools for our analytic and data science needs. We’ve explored data in standalone SQL clients, created reports in Mode, and wrangled data and created predictive models in Python and R.

Recently we’ve done more and more of our work using Mode and Jupyter Notebook. Jupyter runs directly in your browser and allows you to create and share documents with live code from over 40 different languages, including Python and R. In fact, we recently wrote a post about Setting up Your Analytics Stack with Jupyter Notebook & AWS Redshift.

We used to create our initial analysis in Mode, then jump into Jupyter to really dive into the data and carry out more sophisticated analyses that are either cumbersome, or plain impossible to do using SQL alone. Unfortunately, it’s difficult to share these notebooks and collaborate with people who don’t have Jupyter (and Python) already set up on their machines.

That’s why I was extremely excited when Mode announced the addition of Python Notebooks to their platform. Now, my team can keep all the queries used to get the data, any subsequent analysis in Python, and all of the charts in one place, for anyone in the organization to audit and contribute to.

In this post, we’ll demonstrate how Mode’s Python Notebook feature allows you to significantly cut down the amount of time needed to get from raw data to an insight, as compared to using just SQL.

We’ll be working with demo data from Zendesk. Zendesk is a cloud-based customer service platform that collects an incredible amount of data about your customer experiences that you absolutely should be analyzing. To get Zendesk data into your private data warehouse, you can use a tool like RJMetrics Pipeline, which connects to many popular SaaS tools and databases and streams your data to Amazon Redshift. Once your Zendesk data is in Redshift, you can use an analytics platform like Mode to analyze it.

With the demo Zendesk data in our Redshift instance, a simple query below would return submission information about support tickets: who submitted the ticket, when (what year in this case) and via which channel.

select
id as ticket_id, date_part('year', created_at) as year,
submitter_id, via__channel
from zendesk.tickets

If we run this query in Mode, the results are immediately converted into a dataframe and made available inside the Python Notebook in the list variable “datasets”. We’ll go ahead and grab that dataframe from the list (which may contain results of other queries, hence the 0 index).

# get the data from the query
tickets = datasets[0]

Now, if we wanted to visualize the number of tickets as a function of time, we could simply write a single line of code and get the plot:

tickets.groupby('year').size().plot();

This could have been accomplished in SQL by augmenting the original query to include “count()” and “group by” statements. However, the simple example above demonstrates how much more compact and easy-to-read the analysis can be in Python.

Pivoting data

What if we wanted to look at the breakdown of tickets across different channels? This would be much more difficult to implement in SQL, especially if your database does not support pivot functions (Amazon Redshift for example, does not). In Python, however, we can do it all in just one line:

pd.pivot_table(tickets, index='year', columns = ['via__channel'],
aggfunc = 'size', fill_value=0)

And simply adding the plot() function at the end will automatically generate a plot of the pivot table:

Fitting a linear regression model

Say we wanted to determine the average rate of ticket growth for our email and web channels. Fitting a linear regression model in SQL would be extremely time-consuming, if not impossible. However, doing this in Python takes only a few lines of code.

First, let’s get all of the tickets from web and email channels, and count their total number for every year on record.

# get tickets from web and email
email_tickets = tickets[(tickets.via__channel=='email') |
(tickets.via__channel=='web')]
# create a new dataframe
df = pd.DataFrame({'num_tickets':email_tickets
.groupby('year').size()}).reset_index()
df['year_num']=df.index

Now, fitting a linear model is as simple as importing the statsmodels package and typing

import statsmodels.formula.api as smf
lm = smf.ols(formula='num_tickets ~ year_num', data=df).fit()

To see the details of the fit, we can simply write “print lm.summary()”:

And to visualize the data, we can use seaborn’s regplot command:

What this means

Being able to go from an initial query to in-depth and powerful analysis within the same platform is incredibly useful for any data scientist. The possibilities are exciting to say the least when a tool gives you this level of auditing and collaboration capabilities alongside visualization tools and team-wide transparency.

I hope the examples above show just how much faster a user can go from asking a question to producing an answer. The team at Mode has done an incredible job of seamlessly integrating the SQL editor and Python notebook, and I’m incredibly optimistic about the analysis my own team can now run, without having to worry about who has what set up on their computer.

While the Notebook functionality is still a little rough around the edges (currently available packages are somewhat limited; cells can’t be moved around the notebook, LaTex is not yet supported), I’m confident the team at Mode will make quick iterations and I really look forward to using it in my own workflow.

If you’d like to perform the above analysis on your own data set, just sign up for RJMetrics Pipeline to quickly sync your Zendesk data with Redshift, and make it available immediately in Mode.

--

--