Signal-to-noise Ratio

Olivier Huez
Red River West
Published in
10 min readDec 5, 2022

How to apply telecom-based principle to algorithmic sourcing

From Telecom to Data driven Venture Capital investing

I graduated as a Telecom Engineer, taking classes in mobile network dimensioning, communication theory etc… We learned about the inner workings of mobile networks and I was fascinated how such a relatively low power signal from the phone could allow you to have a clear conversation even if the cell tower was a couple of km away with trees and walls in the way!

After graduating, I stayed in telecom and worked at Orange for a few years in R&D then product management in Washington before embracing more financial roles: CFO of two start-ups and then investor at C4 Ventures in London.

But telecom subjects didn’t completely disappear; I even made one investment in that space in an Israeli company called drivenets… But admittedly, most of the technicalities, concepts and formulas of telecom worlds have faded a bit in my memory to be replaced by more financial ones..

There is one however that has caught up with me over the last few months at Red River West : The Signal-to-Noise Ratio

Indeed, we are very proud at Red River West to have developed an advanced algorithmic sourcing platform, which we call RAMP (for Red River Algorithmic Management Platform). Abel actually wrote about what we do in a series of articles over the last few weeks.

As we keep expanding the capabilities of this platform and adding new data sources, the concept of Signal to Noise ratio is becoming more and more relevant.

Let’s see what it means in the context of algorithmic sourcing, why it matters and what we can do about it…

Signal to Noise to noise ratio

SNR or signal-to-noise ratio is the ratio between the desired information (or the power of a signal) and the undesired signal (the power of the background noise). The formula is therefore the Power of the signal divided by the power of noise.

A ratio greater than 1, signifies more signal than noise!

Imagine yourself in a bar on a Thursday night (I used to live in London!): understanding what your friends are saying over the chatter of your fellow drinkers makes this concept very clear: as the pub fills up and the background noise increases, you must increase the power of your signal by speaking louder, getting closer to your friend’s ear or repeating the same sentence several times!

But what does it mean for Data driven VCs focusing on algorithmic sourcing?

First, quick reminder on what is algorithmic sourcing:

As we live in a digital world, all companies leave footprints online: end users download and review apps, employees update their linkedin profile, companies publish job descriptions, founders talk at public events etc… These footprints hold interesting information on the company’s health and its business.

At Red River West, we’re collecting millions of data points for more than 25,000 startups in Europe. We combine these data points to rate the companies and identify the best investment opportunities.

We’re looking for the most promising startups. It usually starts with those which are growing the fastest. Therefore, our main “signal” is typically the pace at which a company is developing…

We also look for other signals. E.g we calculate an ESG score for all startups, in that case, the signal we chose is the level of focus on ESG matters by the management of the company.

What does “Noise” mean in the context of Algorithmic Sourcing?

In our case, noise is simply made of all the unwanted data that are not relevant. If I re-use the example of ESG scoring, greenwashing typically adds a lot of noise that could make our reading of the ESG score completely biased… (which is exactly the point of people doing greenwashing)

More generally, the various signals are calculated from a combination of smaller indicators that are very hard to identify and leverage in the midst of all data generated online.

That’s why looking for “weak” signals about private young companies is no small feat and certainly more complicated than for Investors focusing on listed companies where the information is more structured and widely available!

Law of diminishing returns

Algorithmic sourcing’s basic principle is to start from a database of individual startups (at least a few thousands depending on the investment thesis) and enrich the platform’s “knowledge” of a given startup by collecting a wide variety of data points. These data sources can be structured or unstructured, the data can be quantifiable or qualitative, discrete or continuous etc.

An example of unstructured data is the technology stack used by a given company described in a press release or a job description, an example of quantifiable (quasi)-continuous data is the web-traffic on their website etc…

Some of this data is relatively easy to identify and a data driven investor would typically start with the most obvious ones of course. For example, the growth of employees in a company (quantitative data) is generally a good proxy of the growth of the company’s revenue.

Let’s take a qualitative data point: a successful entrepreneur who a few months after selling her company, updates her linkedin profile with a new title saying “working on something new” is obviously something early stage funds would value a lot.

Using this kind of “easy” trigger points is certainly interesting and a smart approach compared to most VC funds but as it’s rather obvious, it doesn’t provide any differentiation against data driven investors…

In order to differentiate and remain ahead of pack, we need to look for less “obvious” data points

We then hit another famous principle: the law of diminishing return…

As we add new data sources whose quality and relevance are less than the first ones we integrated, we get a less interesting outcome. Our signal certainly gets stronger but the signal to noise level can dip below an acceptable level which makes it unreadable…

Noise adds up, value doesn’t

Another less obvious challenge is that the noise levels from different sources add-up but the signals’ don’t…

Indeed, most of the time, the signal data will overlap with existing information held on a given startup so adding new sources, even if they have an attractive SNR can degrade the global ratio…

Let’s take a simple example and say that our existing signal power is 30W and my level of noise is 15W. I have an acceptable signal to noise ratio (SNR=2)

I find an interesting data source whose signal power is 15W with a noise level of 5W (SNR=3), it looks like adding this new source of data will increase my ratio and therefore the general quality of the data… But in practice, this new signal is likely to overlap with information we already have so the incremental signal added is maybe only 5W, but the noise always adds up, in that case, my new signal to noise ratio is 1.75 (35/20), which is lower than what I had before unfortunately…

Where does the noise come from?

Let’s get practical and list the various reasons why a given data source may generate more noise than contribution to the signal….

Some of them are obvious: data sources can be incomplete (e.g there are gaps in the time series, unreliable (i.e data is sometimes wrong), not significant (marginal added value) but there are other reasons we might encounter:

  • Delayed/old: A good example of interesting and reliable data is the company’s financial accounts which, in most countries, must be reported to the registry of commerce. But accounts for a given year are usually available much later. In a world where things move fast, information which is 1.5 year old has limited value.
  • Difficult to extract: If the data is from unstructured-content: E.g. a press release or an analysis of a company written by a good journalist can contain very insightful data but extracting them in an automated, reliable and consistent way can be complicated. This is also true if the information can’t be scrapped from a website.
  • Frozen in time, it can look like the data collected doesn’t get updated anymore: this can happen when a company rebrands, which happens fairly often in the startup world. E.g. Google was initially called Backrub, in France, Chauffeur Privé changed to Kapten then to FreeNow… it sometimes takes many months for the new name to cascade to various data sources. In the meantime, data for the company are no longer updated and a duplicated company with no history can even be created..
  • Misleading: Sometimes, the data itself doesn’t mean much without additional information or knowledge: E.g. some business sectors demonstrate significant seasonal effects: e.g Tourism sees a spike in activity in January: looking at Quarter on Quarter data between Q4 and Q1 would therefore be misleading…
  • Not consistent with other data: A given data source may display a positive signal while another seems to be negative…. If the number of employees of a startup increases fast and the company receives positive sentiment from the press but its web traffic is declining, what does it mean?

We’ve come across all these sources of noise at Red River West, and there are other reasons why a data source might add more noise than real value for algorithmic sourcing, but these are the main ones worth covering for now.

Cutting through the noise

Going back to telecom principles…

Communication standards like Ethernet, GSM etc have developed ways to limit the effect of noise and optimize Signal to Noise ratio so that your mobile phone doesn’t toast your brain by broadcasting a high power ! (E.g. GSM uses frequency hopping, redundancy etc…)

Likewise, there are interesting strategies to optimize signal to noise ratio for algorithmic sourcing

Here are a few ideas

  • Clean data: Each data source integration should include a series of checks and clean up. It’s important to detect potential errors and in particular to avoid duplicates or mis-allocation. This also raises the question of the unique key used to identify a startup: e.g a startup’s commercial name is sometimes different from its legal name, it can also be known by its main product name!
  • Benchmark: Spend time to benchmark the various data sources: a given data point can be obtained from different sources. E.g several providers can provide web traffic data, some of them are more reliable than others generally speaking, but it can also vary depending on your investment thesis or geographic target: websites with low traffic are not reported by some data providers for example.
  • Ditch “bad sources”: sometimes, more actually means less… we’ve had cases where we decided to stop using a given data source after a few months of testing…. Most of the time, we managed to find similar data points via different means in a more reliable way!
  • Apply some maths:Missing data points or volatile data can be easier to integrate using regression (linear or others), moving average etc…the methods are not advanced mathematics, but it’s important to choose and apply them carefully .
  • Stick to your investment thesis: Focusing on signals that make sense for the fund’s investment thesis will be key to optimize Signal-to-noise ratio. e.g mining patent databases is useful for deeptech funds, likewise, life science fund will learn a lot from monitoring clinical trials, but none of them are for a fund investing in market places eg.
    Geography is also an important factor: because Red River Invests in Europe and early-growth stage, our data pipeline is limited to our use case, this allows us to prioritize and select data for this geography, but we can still manually add any other company which would typically be competitors of the ones we’re tracking.

There are other principles and methods that we’ve applied at Red River West to keep our signals “reliable”. Actually, implementing these strategies was a critical effort during the initial phase of setting up our sourcing platform and it remains an everyday concern today.

Some of these methods are a bit more complex (typically we’ve developed confidence scores for each signal). We’ll keep it as part of RRW’s secret sauce for now!

But I’m happy to help if you have a particular question or if these ones are obvious and you need a hint about other methods! Drop me a line !

What is an acceptable level then?

This does leave two interesting questions:

  • How can we quantify the signal to noise ratio ?
  • What is an acceptable level ?

If I remember well from my studies, the bare minimum in telecom is around 20Db but it obviously doesn’t mean much for Venture Capital… The telecom analogy is a useful guiding principle but it doesn’t work all the way!

It’s indeed almost impossible to quantify the power of signals or the power of the noise in our case, but it doesn’t mean we can’t answer the question.

For that, I need to introduce one last concept: the confusion matrix which compares what the algorithm predicted vs what the data really is. In our case, “Positive” means an exciting investment opportunity…

A confusion matrix looks like that :

We obviously want to maximize True Positive and True Negatives (i.e when the algorithm correctly assessed the opportunity) but in Venture Capital, we can live with a certain % of False Positive (i.e the algorithm “thought” it was an exciting opportunity but got it wrong…), what we want to avoid are the “False Negative” i.e when the algorithm missed an exciting opportunity. Indeed, it’s critical for us that we don’t miss a great investment. For that, we can accept that sometimes, the company detected by the algorithm doesn’t eventually make the cut.

By using a well-crafted data set as a reference, we can see the impact of adding new data sources, new geographies… to our confusion matrix : If a new data source improves the confusion matrix, it’s probably a good idea!

When I was a student, one of my teachers said that what will make the difference between the brilliant students and the rest is that the former will still be able to do matrix calculations in 20 years… As I reach this milestone, I honestly think it would take me a considerable amount of time to reverse a matrix today but I’m happy I remember and still apply some of the principles I learned at the time!

Olivier

--

--