The role of open source Machine Learning platforms in the startup ecosystem— Conversation with Sandy Steier (1010Data), Mark Johnson (Descarte Labs), and Jason Black (RRE)

Published in

ThinkTank.vc

8 min readJan 18, 2017

Jason Black [1:49 PM]
Hi everyone, you should have all been shared on the prompt and the article via email. If not, today we’ll be discussing the role of open source ML platforms on the startup ecosystem. Please read Ben Thompson’s take on TensorFlow to get the discussion rolling (https://stratechery.com/2015/tensorflow-and-monetizing-intellectual-property/) and feel free to bring your own articles for the discussion. Talk soon!

Josef Feldman [3:01 PM]
@jason: will be kicking things off momentarily. Here’s a list of everyone in this group, feel free to connect directly!

Dennis Mortensen- CEO and Founder, X.ai
Sandy Steier- CEO and Co-founder, 1010 Data
Naveen Selvadurai — CoFounder, FourSquare
Ann Miura-Ko — Founding Partner, Floodgate
Sheila Gulati- Managing Director, Tola Capital
Shivon Zilis- Founding Member, Bloomberg Beta
Alex White, Co-founder, Next Big Sound
Amit Karp — Vice President, Bessemer Venture Partners
Jesse Beyroutey- Partner, IA Ventures
Ross Fubini- Partner, Canaan Partners
Mark Johnson- Co-Founder and CEO, Descartes Labs
Jason Black- Analyst, RRE
Phil Boyer- Associate, Crosslink Capital
Sven Kreiss — Lead Data Scientist, Wildcard
Jim Hao, Associate, Firstmark
Morgan Polotan, Bloomberg Beta

[3:01]
@morganpolotan: welcome!

Morgan [3:02 PM]
thanks @josef_feldman!

Jason Black [3:03 PM]
Would love to kick things off with some initial comments on the article. I think the most important piece in there is Ben’s mention of the role of the “unreasonable effectiveness of data”

[3:04]
Very few ML companies these days bank on the fact that their algorithms are better than other companies (exceptions being DeepMind), but rather on amassing a unique data set that can be applied to a novel problem domain.

[3:04]
I know at Dextro — a portfolio company going after video analysis — they emphasize the role of their training data is the key value driver

Mark Johnson [3:05 PM]
When VCs ask me what our secret sauce is, I tell them “our dataset and platform.” Though I’d add to that list: measurement and validation (i.e., do you know that your algorithms are performing/improving/working and how can you tell?)

Jason Black [3:05 PM]
the open sourcing of the underlying platforms only helps them (Dextro) drive more data to their platform because they’re able to focus on business goals (edited)

Morgan [3:06 PM]
my favorite article on this topic is Matt Turck’s “Data Network Effects” http://mattturck.com/2016/01/04/the-power-of-data-network-effects/
Matt Turck
The Power of Data Network Effects
In the furiously competitive world of tech startups, where good entrepreneurs tend to think of comparable ideas around the same time and “hot spaces” get crowded quickly with well-funde…
Jan 4th, 2016 at 2:00 AM

Jason Black [3:06 PM]
@morganpolotan: yeah, that’s a great one

[3:07]
And building on that: http://versionone.vc/data-not-algorithms-is-key-to-machine-learning-success/
Version One
Data, not algorithms, is key to machine learning success
There has been an explosion in machine learning activity, and Shivon Zilis recently mapped out the current machine intelligence ecosystem as we enter 2016. This is one of the key areas that we’ll b…
Jan 6th, 2016 at 10:56 AM

Phil Boyer [3:08 PM]
Agree with Ben that open sourcing TensorFlow makes complete sense for Google, as Google’s value is in the size/quality of its data asset and the quality of its data infrastructure. Letting the crowd use and improve the quality of the ML algorithm only helps Google.

Morgan [3:08 PM]
since we all seem to be in agreement, is there anyone that thinks this is a bad move?

Jason Black [3:08 PM]
Well it helps Google, but it also happens to decrease the startup cost of building out that infrastructure if you are starting a new business

Mark Johnson [3:09 PM]
@phil I argue that Google’s great asset is not just their data infrastructure, but their ability to quickly run algorithms through a testing pipeline to see what is effective — that forces a ton of work on the measurement side.

[3:09]
@morgan What does “this” refer to?

Phil Boyer [3:09 PM]
There are bigger hurdles/barriers to replicated Google’s data infrastructure and amassing its dataset than honing the ML

Jason Black [3:09 PM]
Just as much as Google doesn’t directly make money on those algos, so too can other startups leverage the infrastructure so long as they don’t plan on making money on the commoditized portion

Morgan [3:10 PM]
@markjohnson: outsourcing TensorFlow

Alex [3:10 PM]
I think it’s a pretty safe move for Google to do something like this given their market share. I think Tesla’s open sourcing of their patents up against GM and big Auto is a much riskier bet.

Sandy Steier [3:10 PM]
You need (1) large amounts of data, (2) the ability to handle it (store, preprocess, etc.), (3) the right algorithms, (4) people who know what they are doing, and (5) an organization that can act on the results.

Jason Black [3:10 PM]
I think the more disruptive thing is Google opening up some part/all of its data streams

Mark Johnson [3:10 PM]
let’s be clear: there’s a big difference between open sourcing algorithms and open sourcing the ML features that you discover.

[3:11]
remember that Google’s original algos were published in a paper — they have a long history of doing that

[3:11]
but how you combine all those features is super tough — again, what % of searches are getting better and what features (and combination of features) are contributing to that

Sven [3:11 PM]
Another reason for open sourcing TensorFlow is that they can now recruit people that have prior exposure to TensorFlow. DeepMind and FAIR use Torch which is open source and you can have an intelligent discussion with a candidate about it because they have all used it before.

Jason Black [3:12 PM]
have any of the entrepreneurs in the room tinkered with/used TensorFlow or MSFT’s CNTK? (edited)

Sven [3:13 PM]
I have tinkered with TensorFlow.

Jason Black [3:14 PM]
@svenkreiss: As something you’d leverage? Internally? Sounds like it might help at Descartes Labs as well (cc @markjohnson)

Alex [3:14 PM]
are their numbers released on how widely used or implemented tensorflow has been post-open source?

Mark Johnson [3:15 PM]
we’ve played around with Caffe… good for prototyping but not sure we’d use it in production yet

Sven [3:16 PM]
I was looking into it in my free time. Wanted to know it better. I would feel more comfortable using TensorFlow in production than Torch, but right now we don’t have a use case for it. (edited)

Sandy Steier [3:17 PM]
I looked briefly at TensorFlow. The issue I see is that it is technical (like Hadoop), which brings up my point (4) above. If ML is to be really successful, it needs to be democratized. Someone would need to build a more approachable app on top of TensorFlow or its ilk.

Phil Boyer [3:18 PM]
Do you guys think that open source algorithms will have as big of an impact as open source infrastructure / DBs?

Mark Johnson [3:18 PM]
thinks there’s always a hot new algorithm and it can be a distraction from what you really want, which is results. It’s not the algorithm but as per @sandysteier’s (4), people who know what to do with it.

Jason Black [3:18 PM]
Correct me if I’m wrong, but given the complexity of the task TensorFlow seems like a step in the right direction towards simplicity. Just from a documentation/tutorial standpoint, it’s pretty impressive what TF launched with

Sandy Steier [3:20 PM]
TensorFlow is the first step in what seems like a journey of a thousand miles.

Sven [3:20 PM]
The complexity of ML is another reason why open sourcing TF was smart: There is still a lot of research in Deep Learning. Even if you decide you want to put the effort in and build a product with it, you want to benchmark against a published reference implementation. Those are currently mostly in Torch.

Morgan [3:22 PM]
investors: how do you feel about investing in a startup whose ML implementation is dependent on TensorFlow?

Sven [3:24 PM]
versus Caffe or Torch or versus more traditional tools?

Phil Boyer [3:24 PM]
If a company utilizes TensorFlow, I personally wouldn’t think of it any differently than utilizing an open source database. If there are things you can leverage from the OSS, why recreate the wheel?

Morgan [3:24 PM]
versus writing algos in-house

[3:24]
@markjohnson: explain the thumbs down?

Sandy Steier [3:24 PM]
I’m not a VC but I agree with Phil — as long as the startup adds something valuable of its own.

Morgan [3:25 PM]
@svenkreiss: tool agnostic, i’m mainly curious about startups whose core “machine learning tech” is calling an API…I’ve seen this mostly with IBM Watson

Mark Johnson [3:25 PM]
I’m nervous about just claiming to be a Deep Learning company while using standard infrastructure. Deep Learning promises to end the task of feature selection, and I think that’s still a core and incredibly important part of deep learning.

[3:27]
Deep Learning is a fad right now and I’m not down with fads… I’m down with what works. I remember when boosted trees were hot and we wasted months going down that path when a simpler algo performed way better. Beware companies with religion around a technique.

Jason Black [3:27 PM]
Depends on the startup and use case. I think if it is more a play at a space that will require top performance, building off of an open source and potentially more generic base decreases the flexibility you might have long term (since you don’t own the whole stack) and has the potential limit the your performance (in some cases). If you are leveraging it for a more vertically oriented business application that just happens to leverage ML to assist, I’d be more comfortable.

Josef Feldman [3:27 PM]
@morganpolotan: Google has abandoned projects in the past so I’d think there’s a general danger of over dependance on an opensource community unless you actively manage that community (Nodesource/node.j and Automattic/wordpress being two good examples).

Shivon Zilis [3:28 PM]
Sorry for being late! Trying to catch up on all the brilliance :simple_smile:

Jason Black [3:29 PM]
Welcome, @shivon!

David Wolfson [3:29 PM]
joined data by invitation from @josef_feldman

Jason Black [3:30 PM]
Unfortunately, I’ve got to roll to a call. Please continue the discussion! Really liked the first session and would appreciate feedback. Feel free to DM me. Cheers!

Sandy Steier [3:30 PM]
Arguably opensource is protection against Google (or anyone else) abandoning a software direction. But I’m personally not a huge fan of opensource so I am not going to argue that too strongly.

Josef Feldman [3:32 PM]
Thanks @jason for leading this convo!

Alex [3:32 PM]
i’ve got to jump too, interesting stuff. thanks guys!

Morgan [3:33 PM]
thanks for organizing @jason

[3:33]
@sandysteier: why are you not a huge fan of opensource?

Phil Boyer [3:36 PM]
heading out too guys. good stuff!

Josef Feldman [3:38 PM]
@sandysteier: here’s a study that was done on OSS projects, 152k dormant projects on source forge. http://staff.lero.ie/stol/files/2013/03/2013-Is-It-All-Lost-A-Study-of-Inactive-Open-Source-Projects.pdf

Sandy Steier [3:38 PM]
It’s hard to maintain something that is an amalgam of efforts. It’s usually not the best stuff out there (if it were, it would be kept under wraps). It’s often a camel (you know, a horse built by a committee). Having said that, I think it’s fine for basic stuff (like Javascript widget libraries), but I wouldn’t use it for core functionality. It think the disappointment with Hadoop (which was obvious to me from the start) is a good example.

Morgan [3:41 PM]
good point

Sandy Steier [3:45 PM]
I’m going to hop…

Morgan [3:55 PM]
ciao @sandysteier!

Heston Berkman [5:03 PM]
joined data by invitation from @josef_feldman, along with @graham, @joelusv, @jayfarber, @yanayasevich and some others.. Also, @ann joined, @valerygrx joined, @lawrence joined, @peter joined along with some others.

The role of open source Machine Learning platforms in the startup ecosystem— Conversation with Sandy Steier (1010Data), Mark Johnson (Descarte Labs), and Jason Black (RRE)

Written by Josef Feldman