This year, I’m honored to be the chair of the Artificial Intelligence and Data Science track at the DataWorks Summit in San Jose. Reviewing the submissions and working with the experienced and sharp committee members has been an education in itself, in particular the chance to see what’s trending in the open source world. My day-to-day data science work gives me the chance to dig into a few open source projects, but it’s hard to find time to get an overview of which topics and projects are hot and worth exploring more deeply. The key topics emerging this year are deep learning, graph-based machine learning, and model inference in production.
Not surprisingly, the topics and tools around deep learning (DL) still top the list of big trends, and top-notch research in math and computation are driving progress across vision, speech and text. Many in the DataWorks audience are already developing cutting-edge deep learning systems, while others are just beginning to play with DL. Either way, I suggest attending Magnus Hyttsten’s talk on getting started with Tensorflow.
As you read this blog, a new DL framework might already be baking and being open sourced. It’s harder and harder to keep track of all the new DL frameworks and their capabilities. The complexity can be daunting, especially if you just want to know which DL framework to use for a shiny new project at your company. If that sounds familiar, plan to attend Jeremy Nixon’s talk for some insights about which DL framework to use and why.
Let’s assume you’ve chosen a DL framework and your team of data scientists has created a quick and dirty prototype model on a sample of the data. Now you’re ready to train the model with a much larger data set — but the system dies. If that sounds familiar, check out Wangda Tan’s talk about running distributed Tensorflow.
Let’s assume you’ve trained your model at scale. Now what? It’s vital to actually deploy models in production systems — but not easy in practice. To learn more details about the open source tools that companies are using to deploy models, check out the talks by Sriram Srinivasan, and Sven Hafeneger.
I’m a big fan of SVD. In fact, a linear combination of SVD, coffee, and a ton of hours programming new ideas added up to my Ph.D. I have a particular fondness for SVD because it almost always helps me understand how hard a problem is, how easy it is to break the problem into a smaller one or to check whether I need more or fewer degrees of freedom in a trained embedding model. I’m always looking for ways to use SVD for cool AI applications — which is a perfect reason to attend Trevor Grant’s talk on real-time facial recognition using a distributed implementation of SVD.
Another topic that’s catching my eye is graph-based machine learning methods, which are featured in two separate but related talks. First is Venkatesh Ramanathan’s talk about how to automatically learn features using a network structure and use it for fraud prevention. Second is Namrata Ghadi’s and Adam Baker’s talk on how to use word embeddings and NLP techniques for job skill normalization. Maybe I’ll get to use this technique to filter and organize resumes for my team’s next batch of hiring. Most times exploiting the structure of a problem, like graph-based methods do, helps improve quality, performance, or both.
I couldn’t be more excited about the summit. Looking forward to seeing you there in just a few weeks.