The Data Makes it Different: ML/AI Startups Should Build Right From the Start

Thoughts on the unique challenges faced by AI startups

Preeti Rathi | Partner Ignition Partners

Building and scaling any startup is a challenging task, but ML/AI focused companies face some unique challenges (and opportunities) that other startups don’t. The centrality of data sources & their management to the effectiveness of AI algorithms require that ML/AI startup founders pay focused attention to the data lifecycle if they wish to maximize the potential of their company.

While the core elements of the data lifecycle are broadly well understood, building the ML algorithms is often such an all-consuming task, that often times, AI startups end up not paying sufficient attention to it and implementing the “default”. Identifying the right approach for each of these core elements and creating a cohesive data strategy can make the difference between modest and massive success.

Data acquisition — It is no secret that a huge barrier that ML companies face in training their models is the lack of labeled data required to train it. Additionally, if the data happens to be proprietary, then it can be a source of competitive advantage for the company. Startups may use a variety of strategies to acquire (semi) proprietary data sets including:

· Third Party. Given the importance of annotated data in ML/AI, a new breed of companies have started to offer this as a service. They will gather and annotate the images for you — you just have to provide them info. on the kind of data and annotations you need. Playment AI and Mighty AI are examples of companies that offer such services.

· Synthetic Data. Synthetic data is computer generated data that mimics real data — algorithms can be designed to created realistic, simulated data. Examples of this would include game engines like Unreal Engines and Unity that allow for creation of video games and simulations. Such game engines can be used to create large synthetic data sets. Twentybn is an example of a company that use synthetic data to train their models.

· Partnering with companies that have data. Traditionally analog industries have incumbents with massive amounts of data but building software isn’t in their DNA and are looking for new mechanisms that can help them retain their customers. Partnering with such companies can be a win-win for both. A startup can also try to identify companies that have the data and is looking for ways to monetize their data. An example of this would be geospatial analytics startups that partner with satellite imaging companies to offer insights for sectors like infrastructure (CrowdAI), agriculture (Descartes Lab), etc.

· Gathering proprietary data by offering value-added service to customers. Clarifai and HyperVerge need a large number of images for their core business and have build apps Forevery (photo discovery) and Silver(photo organization) respectively, in order to gather such images. A word of caution, it costs time and money to build such apps — it is important to ensure that you create a strong enough offering that is compelling enough for the user to hand over the data.

· Data scraping. Startups can crawl the internet for data. They can also mine data from publicly available sources. Diffbot is building their business catering to such customers. While such data can be useful in training your model, it cannot be a source of competitive advantage, given the public access to said data.

· Brute force. Gathering data from the customers is always a way to go. The issue here is that the ML effects won’t kick in until sufficient amounts of data has been gathered and the startup will need a human at the backend to ensure that the app performs per customer expectations. Startups in the conversational AI space have utilized this method. This method should be considered only if its easy to gather large amounts of data quickly or if the data network effects can kick in without needing massive amounts of data.

Data Integration — Data integration is an important piece to the success of a ML company. Reliable and smart data integration results in faster deployment and reduces sales friction. Despite this, data integration strategy can be an afterthought for many startups. Those who wait to determine their strategy will face huge costs at a time when fast-action is critical, harming their odds of becoming a durable business. Here are a few items that can speed up data integration -

· Connectors — Build/have the connectors ready to connect to the customer data sources.

· Data storage / governance — on prem / cloud — Important data in enterprises resides in a variety of data warehouses. Some on premise, some in the cloud. In addition to being able to connect to these diverse data repositories, it is important to ensure that all data governance requirements can be maintained even when all the analytical processes are performed in the (public) cloud.

· Compliance — SOC2, FedRamp, GDPR — Several industries will look for specific certifications before they will allow (even temporary) storage of their data in a vendor’s cloud. Obtaining the necessary compliance certificate(s) in advance can significantly reduce the integration time.

Despite all the efforts into simplifying and speeding up data integration, many ML/AI products are pretty complex and will require professional services to be a part of their customer acquisition. Once the product has sufficient scale, it’s easy to find a channel partner who wants those professional services dollars, but prior to that, a startup is on its own. Therefore, investing the time to create a low-cost pro-serve engagement is critical.

Given the low-margin dollars, a startup would like to build as low cost a pro-serve team as possible without harming the goals for which the team is built. Here are some do’s and don’ts -

· Don’t ask your engineers to also act as a customer service management (CSM) team — build a CSM team separate from the engineering team. Engineers are an expensive resource and are best tasked with building the product.

· Be thoughtful in building a repeatable playbook. The efficiencies created by this can help the startup hire and train lower cost employees.

· To keep hiring costs low, build as much of the team as possible in less expensive geographies. To do this successfully, build communications mechanisms so that the team can be effective even when working from a different part of the world.

Data Network Effects — More data lets you train the algorithm better, which creates a more effective product, that brings in more customers, who bring in more data. As a startup scales, data network effects create momentum, allowing the best companies to accelerate away from the competitors. Because strong data network effects can lead to a “winner take all” dynamic, products conceived with network effects in mind have a significant competitive advantage.

To build data network effects, some strategies could be -

· Try to negotiate upfront that while the data will remain customer’s property, the vendor will retain the rights to any resulting insight/learnings.

· Reward customers who contribute their data learnings by providing access to the insights developed from the larger data sets of diverse customers, else restrict their access to the broad learnings.

· Create a pricing structure that rewards customers who contribute their learnings. Charge more to those that don’t.

Building a company is hard. But building an AI/ML company might be harder as companies that ingest large amounts of data take on additional complexities. Founders who understand this — and build their companies “right” from the start — will have a huge advantage over the rest.

###

Preeti Rathi is a partner at Ignition Partners, a leading venture capital firm investing in early-stage enterprise software. Focused on AI/ML startups, Preeti is passionate about helping Seed and Series A startups become iconic businesses. Preeti holds an M.S. in Computer Science from Stanford University and an MBA from The Wharton School of Business.