I Am Not a Data Scientist

But I play one in this blog post, thanks to PixieDust

At a recent All Hands, I shared some thoughts about platforms and notebooks. If you weren’t there, you didn’t miss much. The only takeaway — and takeaway is probably generous — was this Venn diagram:

Readers may notice that there’s an idea lurking in the footnote at the bottom of this diagram. The idea is that notebooks, considered by most to be the domain of the data scientist, have a real shot at helping teams of all types who are working on data problems.

I’m happy with the colors, but to bring this idea to life, we’ll need more than a Venn diagram, amirite?

Enter, PixieDust.

Notebooks for Everyone

PixieDust is a helper library for Python notebooks. It makes working with data simpler.

With PixieDust, I can do this in a notebook…

# load a CSV with pixiedust.sampledata()
df = pixiedust.sampleData("https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv")

# display the data with pixiedust
display(df)

Instead of doing all this…

from pyspark.sql.types import DecimalType
import matplotlib.pyplot as plt
from matplotlib import cm
import math
#Load the csv, this assumes that the file is already downloaded on a local file system
path="/path/to/my/csv"
df3 = sqlContext.read.format('com.databricks.spark.csv')\
.options(header='true', mode="DROPMALFORMED", inferschema='true').load(path)

maxRows = 100
def toPandas(workingDF):
decimals = []
for f in workingDF.schema.fields:
if f.dataType.__class__ == DecimalType:
decimals.append(f.name)
    pdf = workingDF.toPandas()
for y in pdf.columns:
if pdf[y].dtype.name == "object" and y in decimals:
#spark converts Decimal type to object during toPandas, cast it as float
pdf[y] = pdf[y].astype(float)
    return pdf
xFields = ["horsepower"]
yFields = ["mpg"]
workingDF = df3.select(xFields + yFields)
workingDF = workingDF.dropna()
count = workingDF.count()
if count > maxRows:
workingDF = workingDF.sample(False, (float(maxRows) / float(count)))
pdf = toPandas(workingDF)
#sort by xFields
pdf.sort_values(xFields, inplace=True)
fig, ax = plt.subplots(figsize=( int(1000/ 96), int(750 / 96) ))
for i,keyField in enumerate(xFields):
pdf.plot(kind='scatter', x=keyField, y=yFields[0], label=keyField, ax=ax, color=cm.jet(1.*i/len(xFields)))
#Conf the legend
if ax.get_legend() is not None and ax.title is None or not ax.title.get_visible() or ax.title.get_text() == '':
numLabels = len(ax.get_legend_handles_labels()[1])
nCol = int(min(max(math.sqrt( numLabels ), 3), 6))
nRows = int(numLabels/nCol)
bboxPos = max(1.15, 1.0 + ((float(nRows)/2)/10.0))
ax.legend(loc='upper center', bbox_to_anchor=(0.5, bboxPos),ncol=nCol, fancybox=True, shadow=True)
#conf the xticks
labels = [s.get_text() for s in ax.get_xticklabels()]
totalWidth = sum(len(s) for s in labels) * 5
if totalWidth > 1000:
#filter down the list to max 20
xl = [(i,a) for i,a in enumerate(labels) if i % int(len(labels)/20) == 0]
ax.set_xticks([x[0] for x in xl])
ax.set_xticklabels([x[1] for x in xl])
plt.xticks(rotation=30)
plt.show()

To get this…

A scatterplot! No code! With options and controls I can use!

That’s data I can explore!

Stepping Through the Benefits

With PixieDust, I can[1]

  1. Visualize my data, without having to RTFM and trial-and-error Matplotlib (or other renderers)
  2. Explore my data in an embedded interface, and switch between renderers (e.g., Matplotlib, Bokeh, Seaborn)
  3. Use Spark, without having to RTFM Spark
  4. Do those things, all of which I hadn’t done before — not even once — and then share those things with people, which I’m doing now!

With PixieDust, data scientists and data engineers can

  • Use Python and Scala in the same notebook
  • Share variables between Scala and Python
  • Access Spark libraries written in Scala from Python notebooks
  • Access Python visualizations from Scala notebooks
  • Use any other tools they like, e.g., hard-coded Matplotlib, Bokeh, etc.

Now, people with varied skills and skill levels — even people like me — can use and share notebooks, and collaborate.

But don’t just take my word for it. Ben Hudson, an offering manager on the dashDB team, said this about PixieDust:

I wanted an easy way to map out some geographical data I added to the dataset, but all the Python tutorials I had come across were too complex for my needs, so PixieDust was perfect for me.
Instead of having to import a ton of packages and try to reverse-engineer code from an online tutorial, I only had to do a few clicks to generate a really nice map using PixieDust. PixieDust also made general graphing tasks a lot easier (no need for matplotlib) and it was really straightforward to use in general.

(Ben’s even started logging PixieDust issues on Github. Thanks, Ben!)

Use PixieDust

You have a couple options.

IBM Data Science Experience (DSX): Check out the PixieDust intro notebook on DSX to see PixieDust in action. To play with this notebook in DSX, follow these steps to bring the notebook into your account:

  1. Sign In to DSX
  2. Click the Copy icon[2]
  3. Choose an existing Project to add it to
  4. Click Create Notebook

Jupyter Notebooks: If you’re comfortable on the command line, you can run PixieDust inside Jupyter Notebooks on your laptop, too. The PixieDust installation guide has you covered for an easy install, and takes care of configuration and all the dependencies at once (e.g., installs Spark, Scala, the Cloudant-Spark connector, and a few sample notebooks).

Further Reading

Footnotes

[1] I am not a data scientist. Not even close.

[2] We maintain a copy of this DSX-specific notebook at https://github.com/ibm-cds-labs/pixiedust/raw/master/notebook/DSX/Welcome%20to%20PixieDust.ipynb