Python for Data Science & Machine Learning

Gabriel Cypriano
Python Pandemonium
Published in
3 min readMay 4, 2017

This past weekend I gave a talk at Python Conference Espirito Santo 2017. As it was a Python event I decided to talk about the Python tools I often use on my projects.

Python has 75% more Data Science and Machine Learning job openings than R on Indeed.com.

You can find the raw data I’ve used to calculate that here. As Python solidifies its position as the best programming tool to tackle Data Science and Machine Learning it comes as no surprise that half the talks at the conference were about these topics.

We started getting our hands dirty by taking a look at Jupyter Notebook, a pretty popular web app for mixing and matching code along with text and plots that I use in most of my projects. Here is an excerpt of the notebook I walked through during the talk:

To make things more fun, we set out to use the Titanic dataset that’s available on Kaggle. It has data about the passengers aboard the largest ship afloat in 1972 that sunk after colliding with an iceberg.

We used Pandas to load the dataset and to do some basic statistics like comparing female vs male survival rates, as well as the survival rate of passengers with vs without family members aboard. Python and Pandas make that really easy. I also explained how Pandas uses NumPy to vectorize its code and make it more efficient.

We had some fun plotting the age distribution among all passengers with Matplotlib and among survivors with Seaborn. Matplotlib’s new version has gone a long way but I think most people agreed that adding Seaborn makes plots look much more modern. Here is a pure Matplotlib plot alongside a Seaborn one:

Pure Matplotlib on the left and Seaborn on the right.

I also wanted them to know how easy it is to do statistical inference with SciPy’s stats sub-package. I used the following code to show that, yes, we can say with 95% confidence that fares paid by survivors were higher than fares paid by non-survivors:

For the Machine Learning part of the talk I used scikit-learn to show that we could translate relatively complex Mathematical models into a few lines of Python code. The following snippet shows how we built a Machine Learning model to classify a passenger as survivor or not, as well as to calculate its accuracy on out-of-sample data:

If you would like to see more of the code or maybe fork the repository you will be able to find it here (warning: text will be in Portuguese):

And for those of you who speak Portuguese here are the slides:

Disclaimer: I am a Teaching Assistant at K2 Data Science. Our Data Science bootcamp covers all that and much more!

--

--