Modern Cloud ML Frameworks

And why they are a joke.

Dima
Live Long and Prosper
2 min readDec 9, 2015

--

I have a well-formed opinion about modern cloud ML toolkits.

They are games for kids.

Useful for:

  • Teaching purposes.
  • Academia projects.
  • Interview questions and homework of "describe your solution step by step".

Not suitable for:

  • Industry.
  • Business.
  • Non-technical mentors and advisors guiding the team to "look into that".

Logistic regression at scale. Who cares?

If the data is clean enough to apply logistic regression, most of the job is already done.

If the actual data to train on is large enough to not fit a single machine, chances are, something was not done right.

If the challenge indeed is learning-at-scale, where only simple algorithms, such as logistic regression mentioned above, work, I guarantee you, the team that brought the company to the stage of acknowledging the above is strong enough to see learning at scale itself as a straightforward problem.

And by this time the vast majority of actual knowledge is already incorporated during feature engineering phase, and dumping an actual [huge] TSV or structured JSON somewhere to apply learning to it is, in reality, just an extra detour.

To win my trust, an ML toolkit would transform the landscape that includes a variety of large complex problems. The winners of Kaggle competitions, for example. If suddenly there are several people in the top ten, who all got there thanks to leveraging their early adoption of some shiny new toolkit.

Beyond that, I believe the industry is just not mature enough yet to be ready for cloud-based ML solutions. We nailed cloud storage, we nailed cloud hosting; good. Before we get into ML algorithms business, there is another huge gap has to be bridged though: data systematization.

If we keep saving data in free-form JSONs / BSONs / Protobuf / Thrift / Avro / MessagePack, we're not ready. If we don't have an established data versioning solution, we're not ready. If we don't have the process to convert and retire the old data, we're not ready.

If the above approaches are different and incompatible for batch and real-time, we're not ready.

If going from hard disk storage to RAM storage, bypassing and/or making a detour towards flash memory, is not streamlined and automated down to a single flag flip, we're not ready.

New machine learning in the cloud. Ah, so cool. Sorry, I'm busy structuring my data and phrasing the right questions about it. When I'd be hiring three interns to help answer those questions, it would be a bonus if one of the candidates has tried your framework and found it useful.

That's about it.

--

--