Investing in the Data Science Market Map

The natural evolution of big data investment that many companies have made over the last few years is finally starting to bear fruit. The combination of tools like NoSQL and Hadoop, the ever increasing power of Moore’s Law, plus continued advancement of machine learning techniques have pushed the world beyond traditional analytics. The world of the Data Enterprise is here. Companies will use data as a competitive weapon and those with the largest data sets, best teams, tools, and platforms ultimately will win.

It is because of this shift to a Data Enterprise, we are spending a lot of time thinking Data Science and Machine Intelligence. At Drive, we take a view that combines several different technologies and concepts into the world of Data Science and have built our own internal market map to help be our guidepost for this area of investment.

As we continue to develop our thinking around this market, we wanted to share with entrepreneurs how we view Data Science technologies and encourage companies interested in the space to reach out and speak with us.

We believe that Data Science, broadly speaking, will continue to be one of the most interesting and rapidly evolving spaces in technology. It touches upon, and even underlies, so many of the innovations that are driving tech right now, and yet we’ve also found that the market can be very clouded, misunderstood, and convoluted as various tools and platforms emerge.

Below we have outlined how we segment and understand the Data Science market right now at Drive, and welcome feedback and insights from anyone working in the space.

Application of Data

While there are big aspirations for the use of data inside the enterprise or in a consumer setting, we are still in the early innings. Most enterprises have bought “big data,” but have no idea why or what to do with it. On the consumer side, companies have built bots for just about everything and are still figuring out the model.

Drive is still refining our approach to the market as well, and we’re trying to talk to as many startups and customers as we can to gain a perspective.

We have spoken to 487 companies and done a bottoms up analysis on several hundred buyers of data science applications and tools. The biggest insight to date, similar to the world of engineering, is that a new “stack” of data science is emerging.

The promise of machine intelligence is exciting because you can drive new and faster outcomes. The Data Scientist leverages everything from programming languages, tools, packages, applications, and visualization to deliver better, faster outcomes.

The New Stack

Like most things at Drive, we started with a bottom-up approach and looked at the enabling technologies, data, and applications that will shape the world of Data Science. Based on our work to date we broke the stack into five main components and went deep on each. These components are: Inputs (data), Management, Analysis, Outputs, and Applications. What we found from speaking with over a hundred corporate buyers is that there is no one-size-fits-all approach.

*This is just a representative sampling of companies, not an exhaustive list

What seems to be shaping up is a crawl, walk, run, fly journey that most companies (buyers) are just beginning.

A company like Amazon or Google has been leveraging data science for decades and is well down the path of fly (quite literally). The other observation is that this market is moving quickly, as are the supporting technologies.

In under five years “big data” enabling technology has gone from the next big thing to table stakes.

Building Blocs: Inputs

For most data scientists, the first step in any process is wrangling the source data. For simplicity we have broken inputs into 3 categories: Private (self-managed data), Applications (Salesforce, Hubspot, Zendesk, etc), and Other (IoT, wearables, locations, etc). While there are lots of ways to get information from these sources, it’s still limited insight and not data science until you can contextualize this data under a single point of view. Vertical applications and systems of record are great at insights about that specific application, like SalesForce. You can query each of these sources and data warehouse for specific information, but most companies are looking for ways to combine these sources to drive better outcomes.

While necessary to the data science stack, this layer isn’t really “data science” and is better served by companies emerging as the system of record for specific verticals.

Wrangling the Inputs: Management

In our view, data science starts to get interesting when you consider all the management and merging of these different data sets. At Drive we have broken down the management of data into three key categories: Transport, ETL, and Data Lake + Data Backup.

In the early Nineties there was the concept of ETL (Extract, Transform, Load) to get all of your data normalized and moved into an Enterprise Data Warehouse. Today there are thousands of data sources and hundreds of repositories to store your data. A simple ETL process doesn’t take into account the explosion of APIs and endpoints that need to be handled to move your data around, not to mention handling the security and governance of that data. We’ve found that a majority of time spent on any data science project is spent in this step. We often heard the term “data wrangling” to describe the process of managing, moving, storing, and securing data.

As we look at this layer a few things are apparent. First, moving data is hard. Second, securing data in flight is hard. And finally, the natural place for this data to move is the cloud.

Entrepreneurs tend to build great companies where things are hard. We see the movement, security, and normalization of all this data as a big problem to solve. We are meeting with entrepreneurs in this space and continue to learn more about how hard this function is for the Data Science team. Drive believes a big company can be built in this layer of the stack.

The Data Science: Analyze

The high level description here is tools. This is the layer of the stack where the data science really gets done. We have interviewed multiple data science teams, and they have a combination of tools both open source and commercial. One thing that is clear though: it’s all about the right tool for the right job.

The real ‘work’ for the data scientist happens in this area and often starts with tools such as R or Python. Model generation and experimentation falls into the Model category, while things like TensorFlow tend to be one step closer to outcomes as you move up the stack. When Drive spent time with the data science teams thinking about this layer of the process, it was clear that tools are crucial, but they need to be focused on getting the data scientist to an outcome quickly.

This layer is dominated by open source or very inexpensive as-a-service offerings from the top cloud vendors like Google, Amazon, or Microsoft. At Drive we think this is a critical consideration for a data scientist, but are still forming an opinion on the right approach.

We continue to try to meet with as many companies in this area as possible to learn more.

Value: Output

This layer of the stack is where data science becomes the most visible to people, both literally and conceptually. This is where the models and intelligence are able to be consumed in a visual representation of the data. All of the companies we have spoken to have a visualization platform to consume data. The big difference is that the tools that people are using to visualize data are not Data Science tools, but rather Business Intelligence tools.

From our perspective, the key new area that data science adds to the output equation is taking the world of data science into production. In the post-analysis stage of data science you can do two things with the information: you can react to it (report + visualize) or you can make the data proactive. Making the data science work proactive requires the model to be integrated into production.

The companies we have spoken with define production in many different ways. Some companies want models running in real time for transactions, while others would like the models informing supply chain or dealing with risk decisions. Trusting and utilizing your models in the real world requires a new way to think about what it means to deploy. HFT has been using this technology for years and there are still runaway algorithms and software issues that have substantial impacts on world markets.

At Drive we believe there will be big companies built managing models in production. This includes general management, security, and interaction with a real-time production environment. We are actively trying to meet with as many companies in this area of the stack as possible. We feel that this is moving quickly and are spending time trying to talk to great entrepreneurs working on this problem.

Putting it all together: Applications and Outcomes

Drive is currently working on an entirely new market map that is a combination of the data science stack as it relates to vertical applications across multiple industries. Currently this map is focused on things like Aerial Robotics, Marketing, Customer Success, Financial Services, Transportation, Agriculture, Industrial, Legal, and Healthcare.

These applications are using both data science and the technology stack to drive outcomes that weren’t possible just five years ago. Drive will continue to make investments in the Data Science-enabled applications driving proactive outcomes. We believe that this area has the largest opportunity for multiple market-defined companies to be built leveraging data science.

In Summary

We have spent the past year speaking with, and building, a perspective on the new and emerging world of Data Science. We have already made three investments, and continue to see this as a core focus for Drive moving forward. The areas of interest for us are specifically in the Applications, Output, and Management of data. We also include Security as a necessary vertical that runs along each of these areas.

While the market has some very fundamental building blocks for machine intelligence and the Data Scientist is better enabled today than ever before, it is still early.

We believe this market offers to be an opportunity similar to the shift of SaaS over the next 5–7 years, and we will continue to learn from the people building this market.

We would love to hear from you if you have comments or would like to add your company to the market map for future revisions.