AI Strategy: Data Is Your Proxy For Reality

If you put ten AI experts in a room together, and ask them to define “AI”, you will get ten different answers — and a lot of argument.

But if you put ten AI experts in a room together, and ask them about the importance of data to AI models, you will get a single unified answer.

You need lots of highly-relevant data to train a deep neural network.

If the AI model you intend to create is a deep neural network, you will need lots of training data that is specific to the problem you are attempting to solve with your model. ‘Lots’ generally means hundreds of thousands, or even millions, of rows of high-quality data. The more data, the better.

The historical training data set must include all of the features that influence your model, as well as all of the dependent variables that your model will attempt to predict.

Simulated data (sometimes called ‘generated data’) may also be used — if and only if — the simulated data is indistinguishable from real-world data. It must have all the variability, diversity, noise, and edge cases that would be represented in real-world data.

To state it another way, if you have two separate data sets — one consisting of real-world data, and one consisting of simulated data — a competent data scientist should not be able to distinguish which is one is real, and which one is simulated. The data sets would be indistinguishable from one another. If simulated data does not pass the ‘indistinguishable from reality’ test, then its fidelity is not acceptable for training a deep neural network.

Data is your proxy for reality.