Spark’s New Deep Learning Tricks

Spark's New Deep Learning Tricks

Imagine being able to use your Apache Spark skills to build and execute deep learning workflows to analyze images or otherwise crunch vast reams of unstructured data. That’s the gist behind Deep Learning Pipelines, a new open source package unveiled yesterday by Databricks.

Deep Learning Pipelines, which was unveiled at the Spark Summit conference in San Francisco Tuesday, will essentially provide a way to extend the Spark MLlib library to popular deep learning frameworks like TensorFlow and Keras.

This will allow Spark users to leverage existing work they’ve done in MLlib, and to execute deep learning models directly in Spark’s existing machine learning library, says Reynold Xin, co-founder and chief architect at Databricks, the commercial outfit behind Apache Spark.

“It’s a library to integrate essentially all deep learning libraries with Spark to make deep learning substantially easier without having to actually learn about the specifics of deep learning,” Xin tells Datanami.

Deep Learning Pipelines will start out as its own source project, separate from the Apache Spark project, Xin says. Over time, depending on how things go, it could become a part of the main Apache Spark project. “It’s possible” that it will become a part of the Apache Spark project, he says. “We haven’t actually thought a lot about it. We want to get it out there and work with users.”

In the meantime, Databricks will include the new deep learning library in its own Spark-based software as a service (SaaS) offering. Databricks’ version will leverage the concept of transfer learning to take existing deep learning models available in the open domain and modify them to make them more applicable to its customers’ specific domains, Xin says.

“There might be a generic model for doing image classification, but maybe one of our customers wants to detect what kind of car is in a picture,” he says. “We have this techniques called transfer learning built into this library that, with just a few lines of code, allows users to apply an existing model, published by pretty much anybody on the Internet, and then retrain it on a much smaller amount of data in a much faster fashion — in just a few minutes — and then get a better model for their domain.”

Another cool feature that Databricks is adding with Deep Learning Pipelines is the capability to expose a trained deep learning model as SQL.

“With one line of code now the data scientist or data engineer who actually trains the model can make this model available as a SQL function,” Xin says. “So even a business analyst will be able to build, for example, predictions in their BI tools.”

Deep Learning Pipelines supports TensorFlow and Keras now, but will likely be bolstered to support other popular deep learning frameworks. Mxnet is popular on Amazon, while Theano, Torch, and Caffe are also gaining more attention as deep learning techniques become more popular.

This isn’t Spark’s first forayinto deep learning or GPU computing. But the folks at Databricks are bullish that the new Deep Learning Pipelines project could revolutionize deep learning for a more general audience.

“We do see that this library has the potential to do for deep learning what Spark did for big data, to make deep learning much more accessible to everybody,” Xin says. “Deep learning is at a similar stage right now to what MapReduce was for big data.

Posted on