PyData Amsterdam 2016

Published in

Elements blog

6 min readMar 15, 2016

PyData is a conference about data analysis using Python. Every aspect of this, from database implementations to ultra fast matrix manipulation to data visualization is covered in this conference. This weekend (March 12–13) PyData Amsterdam 2016 was held.

As many of you already know the talks at a conference can be really interesting and useful, but that is not the most important part of a conference: The free stickers are. So, to start with that:

First day

One of the first talks was given by Friso van Vollenhoven, the CTO of GoDataDriven and was about big data analysis with MeetUp data. As a real data nerd he setup a Jenkins server to regularly pull all MeetUp data from several cities from their public API and was able to identify and distinguish certain user groups within the tech community and see how they were related. He posted some of his data online in a blog entry.

The next talk by Ben Fields really triggered my interest, because I like unconventional analysis of seemingly uninteresting data. The title of this talk was Do Angry People Have Poor Grammar? Spoiler alert: the answer is no. But the road to this answer was interesting: The speaker used a gigantic database of 59 million (!) Reddit comments and boiled those down to 390k usable comments. He applied some clever Natural Language Processing (NLP) using Context Free Grammar and a tool called VADER (Valence Aware Dictionary and sEntiment Reasoner) to analyse the sentiment of the post and the grammatically correctness of the post respectively.

After that I went to listen to an excellent tutorial by Andres Mueller about the machine learning library SciKit-Learn and the current developments of this open source project. He used IMDB reviews to build a model that predicts the rating of this review by the words in the review and was able to do this in a clear and understandable way.

The final block of the first day included an impressive talk from Sergii Khomenko, a DevOps engineer at Stylight, a German e-commerce company in clothing. He managed to squeeze every aspect of his work, from doing analysis in Python to choosing a deployment method into a 45 minute talk. It was interesting to see how a company in a for me unexpected market was doing serious machine learning and how this company chose a variety of deployment options ranging from Docker locally to AWS in the cloud.

The final talk of this day was a talk about data visualization by Mark Grundland. Most of the talks today were quite technical but in his talk Mark touched the psychology of how people see and want to see information based on data. His most important message was that data analysts should start making their data visualizations from the expectation of the viewer/client and not just plot all the data.

With that great talk the first day was finished and only beer and burgers were left.

Second day

Sunday was off to a good start with the keynote presentation by Peadar Coyle titled The PyData Stack State of the Union. He gave an excellent overview of the current state of some of the lesser known but very good libraries that are available to help the Python data analyst including but not limited to: SciKit-Learn, PyMC3, Jupyter and NumPy.

The only tutorial on Sunday was given by Giovanni Lanzani and was about Pandas, a Python data analysis library for Python that makes all the speed of NumPy very easily accessible. I always thought I would want to manually take care of importing, structuring and labeling my data, but Pandas makes this effortless. So if you ever need to analyse large amounts in of data stored in arrays you should definitely give Pandas a try. And one more reason to use Pandas: It makes using Matplotlib suck less.

The next talk was by Maciej Kula, who works at Lyst, another e-commerce clothing company (since when did the exciting stuff happen at those companies?!). He talked about LightFM, a machine learning tool that uses a hybrid approach: it adds user and metadata to a traditional matrix factorization algorithm. By adding metadata, newly added products can immediately profit from the already existing knowledge in the model.

At the end of the day, Margaret Mahan, a computational biologist, gave an interesting talk about using the HDF5 file format for data storage. I really liked this talk because most of the current developments are about big data that can only be handled by specialized database structures, but most analyses are actually done on smaller data sets and handled by people that do not want to worry about managing a database system. HDF5 and its’ Python library h5py by offering an structured, portable, cross platform file format that is quite speedy as well.

The conference ended with an hour of Lightning Talks: short 5 minute talks about all kinds of things: Ranging from Amazon Lambda to monitoring traffic with Twitter, from medical image analysis to a more philosophical presentation about the gap that exists between software developers and data scientists. The concept of these very short talks worked very well, because the attention span of visitors at the end of a conference is very short and on top of that it gave some people that otherwise would not talk at the conference the opportunity to present their ideas.

Afterthoughts

The nicest thing about conferences like this is that after the conference you have a list tools or techniques you want to master or use more often. Here is my list:

SciKit-Learn
Pandas
Jupyter
Amazon Lambda

As a Python/Django back-end developer at Elements I am not exactly doing data analysis on a daily basis, but I think these new insights are nonetheless applicable to my field of work:

Python makes data analysis really easy.
Django makes web development really easy.
This combination makes data analysis on the web really easy.

Advanced analysis (of for example customer data on an e-commerce website) can be implemented side-by-side with your standard Django code base. Every analysis library for Python can be used immediately in your web application. I will certainly try to implement this into my code in the near future!

One final lesson about data science to end this blog post: How to get respect from a data scientist:

Step 1: Casually mention ‘Bayesian Machine Learning’ in your conversation.
Step 2: Done ;)

All conference talks are recorded and posted on YouTube.

— — —
Update March 30th, 2016: Added talks on YouTube.

Originally published at www.elements.nl on March 15, 2016.

PyData Amsterdam 2016

First day

Second day

Afterthoughts

Written by Aart Goossens