Meet Mike Lee Williams: Serverless and its Relevance for Data Scientists
This post is part of a series introducing the speakers at the PyBay2018 conference in San Francisco this August. It’s a great chance to learn and connect with an engaged and diverse community of Python developers. We hope you’ll join us!
What are you going to be speaking about at PyBay 2018, and why are you excited to give this talk?
I’m going to be talking about serverless, which I think of as a way of running applications on lots of machines that each exist only for the duration of a single function call. It’s a HOT HOT HOT buzzword right now in the mainstream operations and engineering community. But I’m going to focus on its relevance to data scientists. With a little work, it offers people working with data an admittedly hacky, but also lightweight and almost arbitrarily scalable alternative to a distributed system like Spark. I think it’s technically cool, but it’s also very empowering for data scientists who would otherwise depend on (and be constrained by) the data engineering infrastructure they have access to.
How did you get into programming and Python?
As a Physics undergrad back in 2000, I read an article by Eric Raymond, in which he recommended Python as a flexible, easy-to-learn language. I don’t recommend reading articles by Eric Raymond in general, but I have to admit that this one changed my life. I picked up the basics and pitched my physics department on changing their teaching language from Pascal to Python. They let me run with the idea for my undergraduate thesis, and I wrote a course that, although woefully out of date is still online to this day. This was back in the days when numpy was called Numeric, there was no matplotlib, and Python was an eccentric, obscure choice in scientific computing. I used an even more eccentric, obscure commercial language called IDL during my astronomy PhD, but by the time I left academia, Python was the only game in town. That early experience turned out to be very useful. You can read more about this bit of history in an article I wrote for Pyzine.
What’s one of the features about Python you like the best?
This isn’t a language feature, but it’s doubtless in part a result of the way the language is designed: hands down my favorite thing about Python is the fact that’s it’s possible to do “serious” numeric/scientific/machine learning coding in a language that is also a general purpose programming language. Being able to examine and visualize data, train a model, and develop a web API for the model in a single language is a superpower. My talk is a good example of this: I’ll be talking about machine learning and web operations, and I won’t need to use any languages other than Python.
What’s your favorite Python library (core or third-party), and why?
scikit-learn, for two reasons. Firstly, it’s batteries included, and feature-packed. But secondly, and perhaps more importantly: all those features are not just thrown into a zoo of things you get when you do import sklearn. Instead every model or transformation of the data conforms to the same basic fit/predict or fit/transform API. If you’re not a machine learning person, this might seem like table stakes for good object-oriented design. But it’s surprisingly rare in the data science ecosystem. I think of scikit-learn as the requests of data science :-)
Subscribe to catch more interviews with the PyBay2018 speakers!