An Introduction to Company Embeddings

Published in

Filament-Syfter

7 min readMay 12, 2022

— Authors: Ayesha Hafeez, Cynthia Masetto

“ML is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.”. The predicted patterns are able to show the relationships between input features and output target. Usually we expect that instances (e.g. pictures, texts, sensor readings) with similar features lead to similar predictions. In consequence, the representation of these input features directly affects the quality of the learned patterns.

An illustrative example of the power of ML is that of music streaming platforms like Spotify. In order to accurately predict what new music you would enjoy, the Spotify engine recommends new content on the basis of your previous listening history. In order to build your next playlist based on your listening history, Spotify would need to represent each song with a feature vector of real (numeric) values. The element of this vector would represent the value of a feature of a song. Useful features could include things like duration, number of songs in the album, artists name, genre. However, feature representation in other cases is more complicated as the input data is unstructured. This data usually includes text (lyrics), images (album cover), audio (melody) and in order to use them for ML tasks we will need real-valued feature vector representations. The question is: what happens when you combine all the data into embeddings and try to solve different problems?

What’s an Embedding?

An embedding is a translation of a high dimensional vector into a lower dimensional space. It’s similar to a summary of a long paper in which the most important and salient details are captured and the less useful information is thrown away. In the same way that a business manager and a software engineer might capture different summaries of the same book based on what they find useful in their own lives, embedding models learn to capture the detail that is most pertinent to the task that they’ve been trained to do. An embedding is able to capture relationships between different input features by learning to place semantically similar inputs close together in the embedding space, analogous to filling a room with objects and putting the similar objects physically close together. In essence, an embedding is a feature representation technique that incorporates relationships between multiple input features as opposed to regular feature representation techniques that look at encoding each feature individually. Furthermore, since embeddings are learned through the self-supervised task of input reconstruction, they can be leveraged to train high-fidelity models for downstream tasks more efficiently. This means that the downstream models will be more resilient to downstream tasks with low volumes of annotated data. In the context of the Spotify example, encoding a user’s listening history would help the Spotify engine identify meaningful relationships between different songs, artists, etc. This would mean that the engine can learn your preferences and is able to group relevant audio content together to create personalized playlists.

The concept of using embeddings is more prevalent in domains within Machine Learning such as Natural Language Processing (NLP) and Computer Vision. In NLP, embeddings capture the semantic meaning of text. Examples of embeddings are GLoVe, Word2Vec and BERT which are then used to improve prediction accuracy of downstream tasks such as sentiment analysis, questions & answering and topic classification to name a few. However, there is little work being done in using embeddings to represent domain-specific entities (although some notable examples include a model for embedding molecular structures and Google’s music embedding model )

Motivation & Investment Analysis in Practice

One of the main challenges our Private Equity clients face is finding suitable investments in line with their investment criteria (sector, size, geography, etc.). Using automated channels for origination can provide them an advantage over their competitors by getting them ahead of an expensive auction where they would have to compete with other funds for an opportunity.

Through our engagements with various private equity clients, we have seen quite a few commonalities between client requirements around finding companies based on investment strategies, analyzing the market for competitors, and predicting revenue of privately-held companies. Although all the aforementioned challenges are unique, they largely rely on similar data about a company

A research analyst working for a PE or hedge fund views an investment opportunity through a number of different lenses, including:

Overview: for example, geography, industry and size of the company
Financials: refers to financial metrics that indicate the performance and valuation of a company. Examples of metrics include EV, EBITDA, total assets, revenue and expenditure
Key People: individuals who make up the management and board of the company and/or are in position to make driving decisions.
Funding: recent funding rounds and investors in the company.
News: recent signals in the news and annual reports around major events in a company. Examples of such events include strategic changes, growth, hype, distress, activism etc.

The goal of this research is to improve the process of finding new potential investment opportunities from a universe of companies by automating the methods an analyst follows. We foresee huge potential in incorporating company attributes into a representation that captures interdependencies between these contexts. We call this representation Company Embeddings: our hypothesis is if an embedding is crafted correctly, it can be effectively repurposed for multiple downstream prediction and unsupervised tasks (more on that in the following section!)

Potential Use Cases

Company Similarity — Competitor and Industry Analysis

Investors typically focus on a few industries, geographies and companies that fall within a specific bucket of Total Enterprise Value (TEV). Their investment patterns are tried and tested and do not change significantly over time — new opportunities tend to be similar to previous ones. To automate this specific case, we can translate each company into its company embedding (i.e. vector representation) and compute distance metrics in vector space between all possible pairs of companies. If an investor wants to view companies similar to previous successful investments, we can rank similar companies using the distance metric. This is comparable to the Spotify engine recommending songs based on the user’s listening history (by genre, release date, etc.) and pushing material by creators similar to the user’s liked or most listened to artists

Lead Recommendation

Recommendations are typically generated using the following approaches — content-based filtering and collaborative filtering or a hybrid between the two. Content-based filtering recommends items similar to those that a user has previously interacted with, whereas collaborative filtering recommends items based on the interests of many other users with similar preferences or demographics. In the context of our Spotify example, content-based filtering is the model used by the engine to recommend material based on the similarity of recommended songs to material from your listening history (which is the strategy employed by the similarity model), whereas collaborative filtering looks at how closely your preferences align with that of another user with similar taste and recommends new songs which are popular amongst those users.

For our specific use-case, a hybrid model would be most suitable as it allows us to generate recommendations by modelling both users and companies separately. This approach is also capable of performing better when we have limited interaction data (i.e. cold start problem) for a new user or a new company since we can fallback on user similarity or company similarity respectively for recommendations. A hybrid based lead recommendation model would operate as follows: User A who is interested in Company X from a universe of companies would be recommended:

Company Y which is similar to Company X in feature space (i.e. semantically close embedding vector)
Company Z which is liked by User B who has a similar profile to User A based on certain characteristics and prior engagement history (i.e. liked similar companies)

Let’s solidify this example further through an example. Suppose you have a universe of 100,000 companies and records of which companies had been potential leads and which of those turned into successful investment opportunities. Our premise for building a recommendation model would be as follows:

Investors want to determine new opportunities (i.e. companies) that are similar to their prior successful ones.
PE firms tend to have research teams with specific industry focus or operate on a specific investment hypothesis (i.e. mid-sized ecommerce companies). Therefore, they would like members within the same team to be recommended companies within their focus area.

To leverage these assumptions in practice, we would require user profile information including their area of expertise, position along with their lead engagement history. We would then generate embedding vectors for both the users and companies and recommend new companies by applying a combination of the following techniques on the vectors: matrix factorization, single-value decomposition, neural collaborative filtering and neural factorization machines. Once we have a recommendation system in place, we can collect implicit feedback by monitoring the engagement of the recommended leads or explicit feedback by asking the users to rate the recommendations. This is crucial in understanding how well the recommendation system is doing and also can be used in improving the logic through retraining.

Predicting Missing Values

Another potential application is using company embeddings to predict missing information. The hypothesis here is that companies with missing data points are likely to have similar properties to other companies in the same region of the embedding space (e.g. a company with a missing revenue is likely to have a similar revenue to a neighboring company in the same sector with similar numbers of employees, similar financial history, similar prominence in business news). This approach could be helpful in identifying valuation of private companies by using baseline information and valuation of their public counterparts. Uncovering such information effectively would prove to be valuable to a lot of investors. In practice, the missing data point of a private company is calculated by taking the average, median or mode of its closest neighbors in vector space.

Conclusion

We hope this first blog was helpful in understanding how embeddings are built and potential use cases.

We will be conducting a series of experiments for different ways to devise company embeddings, including exploring TabNet for structured data, and identifying whether it is best to train embeddings model by training for a generic task of reconstructing the input vector, or training specifically for a certain use case.

Literature Review

Machine Learning: What it is and why it matters | SAS.
Neural Factorization Machines for Sparse Predictive Analysis https://arxiv.org/abs/1708.05027
Neural Collaborative Filtering [1708.05031] Neural Collaborative Filtering
Modelling tabular data with Google’s TabNet | by Mikael Huss