Getting data that no one else has may feel like you’re searching for a needle in a haystack…but it CAN be done!

You don’t need big data to build a top performing AI model. You need good, high quality data.

Data Curation will be your competitive edge

5 min readMar 10, 2023

If data is carefully prepared, a company needs far less of it than they think. With the right data, companies with just a few dozen examples or a few hundred examples can have A.I. systems that work as well as those built by giant companies that have billions of examples.

Big data was way over-hyped.

It’s important to note that the AI models are not highly proprietary and are not highly coveted assets — case in point the oh-so-popular Large Language Models (LLMs). At this point models are highly commoditized, open-sourced and mean absolutely nothing in the absence of the data you give it to train on.

Most companies need to move away from a model-centric approach and instead focus on a data-centric approach.

You can break down machine learning & AI into two separate parts: training and inference. The real key is in the training.

So if you’re a company you are probably asking yourself how to get an advantage over your competitors. Where is the real value? How do I get great precision/recall or amazing sensitivity/specificity without having massive training data.

The real value…the real secret sauce is finding non-obvious + proprietary sources of data to feed your models.

Real-World Example

Let’s say you build a model to detect tumors. There are lots of people doing this. But which company will win? The company that will win may be the company that vertically integrates, buys a hospital system and gets access to patient data that is completely proprietary to them and covers the most number of women of all age groups and of all ethnic categories.

The above is an example of the business moves that we’ll see over the next 5–10 years, which is exciting.

No, this does not mean that you must buy an entire hospital to get an advantage. The example above is more on the extreme end. But this basic principal can absolutely be applied to other industries outside of healthtech/healthcare. It’s also very relevant for:

manufacturing
food & beverage/CPG
agriculture
automotive
electronics
pharmaceuticals and infrastructure

Tips to keep top-of-mind

Statistics is so important. If you don’t know stats you won’t be able to grasp much of machine learning. Statistical significance is crucial (95% confidence interval) so it’s important to work with great statisticians throughout.
Your model needs to learn from examples that are reflective of the final population that you’ll be making predictions for.
Models should not be generalized across population groups, so be sure that your population group is in represented very well in your training set. I’ve seen models performance vary quite drastically across age, race as well as the original source of an image).
Stratify. When analyzing performance be sure to stratify and assess each sub-group for performance.
Be transparent. Inform everyone involved [in easy-to-understand English]. And get their consent.
Get domain expertise involved.

What is a Data Curator / Data Collector / Data Gatherer?

Allow your mind to imagine the most ideal source for the problem you’re trying to solve. What could be that very unique, singular source of data for your use case? A unique source of data (especially when used in reinforcement learning) will make your output skyrocket!

Today, there are very few of these unique sources of data. And that is where your opportunity lies. Some call it Data Curation, Data Gathering or Data Collection.

A Data Curator builds high quality datasets using traditional data gathering methods and by running experiments. They create unique datasets that become a competitive advantage for the business.

A Data Curator comes into play by doing two things: Data Discovery and Data Curation:

Data Discovery: Data Discovery is about first understanding business needs (what the company is trying to accomplish), as well as the company’s strategic goals. And also by asking what metrics does the business measure to indicate success?

Data Curation: Creatively go out and gather these datasets — typically through a controlled data gathering experiment.

The data curator needs to have a strong background in data science and machine learning because the datasets they curate are going directly into models to either;

deliver specific insights to the business.
enable specific features in a product that no one else outside of that company has, because that unique, high-grade data was curated through this very rigorously controlled process.
enable models to create something novel rather than the same generic insights from datasets that everyone has access to.

Your competitive advantage is going to be your process and your experiment! It’s a discovery in which I’ve figured out a very interesting way to go out there and gather data that has never been gathered before.

Discovery is not about going out there to look for APIs or open data, it’s creating an experiment.

Curation requires:

Knowing what format the data should be in
What metadata should be associated with the dataset to make it useful
Making it easier to gather follow up data
Building a catalogue of everything that could be gathered and documenting the entire methodology and experiment (in addition to annotating the data)

But how? Answer: It’s complicated.

On the face of it, it might not sound that hard to do. But when you dive into discovering unique datasets, you realize that it is hard to do — it’s hard to gather data that no one else has and is not publicly available. I’ve experienced this difficulty first-hand and eventually navigated this process end-to-end.

My Experience

I’ve worked on various experiments at 3 different companies (including a clinical trial).
I’ve interfaced with the FDA and trust me they have a very high bar of engineering requirements (and for good reason…to protect patient health and safety).
I was the Chief Technology Officer for a Medical AI company that was creating AI that would detect eye disease by taking pictures of the inside of people’s eyes (computer vision).
I built and took an AI product through a full clinical trial (used on hundreds of patients). V1 was created in under 3 months. The process can be quite involved, so you need someone to help you expedite and avoid mistakes that waste time and lots of money. You need someone who has “seen this movie before”.

Contact me if you want to build a successful AI system that can be done using the least amount of data possible. I craft and run a robust Data Collection methodology.

Hi — My name is Nwamaka, I’m a fractional CTO/interim CTO.

I’ve built and sold companies.
I’ve hired and led engineering teams.
I’ve raised capital.
11+ years experience building products, launching companies.
I’ve worked with early-stage startups & at large enterprises.
★Above all, what you’ll get from me is integrity.