Data-related Projects Are Vulnerable To This Underwater Rock

Until the very end, you don’t know whether you will successfully mine knowledge out of real world data.

Maxim Kolosovskiy, PhD
3 min readJun 29, 2022
Photo by Lukas Blazek on Unsplash

Regular vs Data-related projects

By a regular Software Engineering (SWE) project, I mean a project that doesn’t need to collect and mine any knowledge from real world data. By a data-related project, I mean a project that should mine some useful knowledge that determines how successful/helpful the project will be.

In a regular project, most of the risks that could hinder the project are identified at early stages. Therefore, such a project would fail fast, which is much better than failing slow after large investments. In a data-related project, the certainty to succeed comes later — only when necessary data is collected and hypotheses to mine useful knowledge are confirmed.

Image by author

Where does the uncertainty come from?

Let’s consider an abstract classification problem as an example: a classifier algorithm needs to separate blue (positive) and red (negative) instances (see below). Practically, a classifier is a compromise between one which is conservative (that heavily penalizes for false positives, i.e. negative instances classified positively) and one which is loose (that penalizes for false negatives, i.e. positive instances that are not classified as such). Think of it as a virtual slider that picks the most reasonable trade-off between these two extremes.

Image by author

When designing input signals for a classifier, we normally don’t know the shape of a cluster of positive instances (or whether there are several clusters). Furthermore, we don’t know how a given signal and a cluster relate spatially; namely, their relative positions, shapes and sizes. Thus, we don’t know whether we will be able to combine the signals to compose a good classifier until the data has been collected and we are able to experiment. In the other words, until we have inspected the structure of real data, we cannot really prove or refute any hypothesis about a potential classifier’s performance. This would just be speculations.

How to address this?

This uncertainty (the ‘underwater rock’) may complicate pitching a project. Here is some advice to mitigate this:

  • The initial and design stages should be enriched with a manual data collection to test hypotheses at a small scale.
  • Collect data for several hypotheses or signals. Thus, there will be space for maneuver: you can pick only the best hypothesis(es) or signal(s), or pick a combination of them that works. You will not have to start a new signal collection from scratch.
  • Position your product as assistance at best efforts rather than magical substitution of what a human normally does. Thus, you could start small with decent results and then improve gradually.
  • Form hypotheses before determining input signals; e.g. I expect that ‘if X>T, then a product would output R’. The opposite of this is to collect some general signals and only then find some insights, which is a valid approach too, but undesirable.

--

--

Maxim Kolosovskiy, PhD

SWE & Automation Enthusiast @ Google | PhD | ACM ICPC medalist [The opinions stated here are my own, not necessarily those of my employer.]