Natural Language Processing for Competitive Market Analysis

5 min readMar 11, 2020

Natural Language Processing (NLP) involves the disciplines and techniques that enable computers to process human language. This article discusses how NLP can be leveraged to discover insights that complement traditional market mapping.

Flask app deployed on Heroku to visualize competitive landscape

Problem: When considering new market opportunities or potential investments, evaluating the competitive landscape is standard due diligence. The procedure typically involves first enumerating competitors and then analyzing your business’s relative position via SWOT, SOAR, or other 2x2 matrix-yielding process adored by MBA professors. While these techniques are useful and battle-tested, the danger lies not in lethargic analysis but in failing to properly identify the competition.

Example: Let’s say you’re reviewing an investment in a startup that designs high fidelity portable audio equipment. Your research has thoroughly identified all major players in the audio products universe: Bose, JBL, Sony, along with fledging upstarts. Good job, let the SWOTing begin... But wait: Does your list include Shenzhen Hangsheng Electronics Company, Ltd? Neither did mine. From their website, Hansheng is “…the leading Chinese company specialized in developing, manufacturing and sale automotive electronic products.”

Shenzhen Hangsheng Electronics Company: Automotive Electronics Products

Hengsheng has captured 30% market share in China domestically with 2,000 employees and their European arm Hangsheng Technology Gmbh has another 4,000 employees in Europe distributing parts and components to 48 major automobile manufacturers. Considering car makers consume more electronics than any other industry, our competitors list may be missing one of the world’s largest audio products companies — and a firm already building audio components could more readily enter the space than a power tools manufacturer.

DeWalt 20-Volt MAX Bluetooth Speaker — Home Depot

To have found Hangsheng and accurately painted the competitive landscape in our market research, apparently we would have needed to include companies from industries that manufacture components, consume audio parts, and those generally requiring portability: electronics, automotive, and construction. The point of this example is to demonstrate that truly exhaustive market research is… exhausting. So in lieu of trading depth for breadth, if we want to go a mile wide AND a mile deep, computers are the solution.

Impetus: The idea for this project came from a study by MMC Ventures citing that 40% of EU startups advertising themselves as AI were not materially utilizing artificial intelligence:

“We individually reviewed the activities, focus and funding of 2,830 purported AI startups in the 13 EU countries most active in AI…approximately 60% of the cases — 1,580 companies — there was evidence of AI material to a company’s value proposition.” -MMC Ventures, The State of AI: Divergence, 2019

Massive respect for MMC’s ambition and execution; it would require a regiment of motivated MBAs to analyze nearly 3,000 companies - and precious few could technically evaluate a codebase. The question then emerged: rather than manually reviewing 1000s of companies, how can a method be developed to programmatically quantify a company’s position on the competitive landscape?

Source: MMC Ventures (2018 data to October)

Companies define themselves, describe their products, and differentiate their services with language. However, computers only operate on numbers (fun fact: computers can only perform addition). NLP involves the process of converting language to vectors for processing, so it’s the logical weapon of choice. The friction comes from dimensionality: vectors that represent a group of words, known as documents, can be hundreds and often thousands of dimensions in hyperspace. The problem is that hyperspace cannot be visualized. We could still compute competitive proximity, but without a visualization, leadership may (rightfully) ignore your insights.

Process: The process I employed involves first vectorizing company descriptions and associated categories from Crunchbase’s list of startups in Artificial Intelligence & Machine Learning with Scikit-learn’s CountVectorizer. The startup list focused on ventures in the Bay Area, Los Angeles, New York, Boston, and Austin. These vectors were then processed with a technique called Latent Dirichlet Allocation (LDA), which calculates the probability a vector belongs to a certain group or topic. Finally, the visualization problem is solved using a dimensionality reduction algorithm called t-distributed Stochastic Neighbor Embedding (T-SNE) which maps these hyperspace vectors to a 2D plane. The resultant (x, y) coordinates are plotted using Bokeh in a Flask app deployed on Heroku.

User Guide: The startups are depicted by dots and colored by city. Hovering over a dot or group of dots will display the company names and categories. The app visualizes the NLP process’s computational findings, but it’s still up to the human to discover and confirm insights:

Perfect Competition: 13 companies sharing a common coordinate

Tightly clustered groups may indicate a more competitive segment. In the Perfect Competition figure, 13 companies share so much commonality that their positions are indistinguishable.
Dots in relative isolation may indicate a novel idea or uninhabited market. This observation may also simply indicate novel language.

Emergence: New York City firms overrepresented in cluster

Stratified or linear groupings may result from firms uniquely applying a common core technology, such as computer vision.
Firms in similar industries may tend to geographically self-organize as described in M. Mitchell Waldrop’s Complexity. This phenomenon is depicted in Emergence figure, where the cluster shows a relatively high representation of New York City firms.

Future Work: This project’s objective was a proof of concept vice a product. By necessity, language is inherently precise. However, it’s blunted by the regional, cultural, and creative variations people introduce over time. Thus, the most significant source of error is not the use of linguistics to calculate market position (humans do the same), but rather on the variance writers introduce when articulating a firm’s value proposition. But just as we understand “puppy” and “small dog” essentially to define the same concept, NLP can also be used to solve this problem by distilling imprecise or verbose language to boilerplate components — so stand by for Version 2!

Thanks for reading, check out the app on Heroku, and reach out with any questions on LinkedIn. For technical information, see Venture-Market-Proximity repo on Github.

Natural Language Processing for Competitive Market Analysis

Written by Russell W. Myers