Does AI Really Need that Much Data to Train?

Khun Yee Fung, Ph.D.
Programming is Life
6 min readJun 1, 2024

My opinion is actually, no, for most purposes, it does not. Then why do all the AI companies want to invade your privacy and suck up all the data everywhere to train their AI systems that don’t really need that much data to train?

Well, the answer is actually very obvious. It is to their advantage. This is especially true for Google, as their whole business model hinges on one thing: knowing what you might want to buy, and target the right advertising to get you to buy.

For instance. How much data is needed to train autonomous vehicles? Okay, this one is quite interesting. We have to start from the beginning. Throw out all the levels of autonomous driving, as they don’t mean very much if you only want a vehicle that will drive itself properly. Instead, think about how human drivers drive to a strange place. Do we need a very detailed map to get from any point A to another point B? Well, no. Do we need to send data to somewhere else to find out where we are? Also no. How do we do that? It is obvious: we find local clues. We look at street names. We have a rough mental map from A to B. And we remember where we have been. All in the head of each of us. We don’t need to download or upload any data to the skynet.

If you object that how is this kind of autonomous handle sudden situations? Well, unfortunately, it does not. But the data-sucking, privacy invading of autonomous driving does not either. So, there is no difference in terms of solving the thing that both kinds can’t handle: situations that require cognitive abilities. So far, only human drivers have that ability.

Fine, so how do we do it a different way? Well, by providing road signs that are machine readable without AI. Ha, you say, isn’t that a big undertaking? Why, of course. As big as putting up signs for human drivers that are easy to understand without using human language.

What sign is this? (Transportation Association of Canada)

This is the North American sign for “yield”. No language there. We do have to memorise the shape. And many of us will still confuse it with the next sign.

But that is not a big deal, as one is moving, albeit slowly, the other one is stationary. Not a lot of cognitive ability is needed to detect if something is moving or not.

How do we know where we are without consulting the map? Well, usually, we check the streets we are passing by. An autonomous vehicle can do the same thing, if it can acquire the street signs meant for autonomous vehicles.

What happens if a vandal steals a sign meant for autonomous vehicles? The same happens for a stolen sign meant for human drivers too, no? There is no difference there: drivers who are familiar with the place don’t really need the signs that much, and that should be the same for autonomous vehicles in the same situation. If you are not familiar with the place, you might get confused. So, the vehicle might get confused too. Big deal.

But autonomous driving is safer, right? Actually, if we have automatic distance sensing and collision avoidance mechanisms, most of the common human driving mistakes can be eliminated. So, the non-AI parts should definitely put to use for even human-driven vehicles.

I can go on and on. But the main point is this. If Google knows where and when and why you drive from point A to point B, it can target you better with ads. That is the fundamental point of this whole thing.

What about ChatGPT, you ask? Well, I think it is interesting. I know it can produce an essay about nuclear fission if I ask it to. But since I am not a nuclear physicist, I would not be able to tell if it is any good. Okay, what about me asking it something I know well. Fine. But what is the point? If it produces that is good but that I know, I don’t need it. If it produces something good that I don’t know about, maybe there is a reason I don’t need to know about it? How about it replacing me as a programmer? I mean that might happen, of course. That would be much cheaper, right? Sure. But you will have to hire someone to make sure what it produces is good. In other words, you might need to hire me back to check what AI is producing is good. So, experienced programmers don’t need to worry about AI replacing them, as they are always needed, whether to write programs or to verify programs produced by AI. Nice and dandy. How do you get experienced programmers? They don’t grow on trees. Programmers need to write programs to become proficient. Reading programs written by someone to learn programming is like me watching downhill skiers skiing downhill. I would not be able to ski downhill even if my life depends on it if that is how I learn to ski. That is just how it is. Skills need practice.

Now, if you can produce a cognitive device, that would be different. But that is the same hard problem that Google and all the other AI companies are trying to solve. And so far, they have failed. And the thing is. Cognitive ability is not associated with the amount of data possessed. Not all intelligent people can memorise very well, and not all good rote learning people are intelligent.

Connectionist AI is hitting its limits, albeit much farther along compared to Rule-based AI (GOFAI). It does very well with noisy situations compared to GOFAI. But it does not do as well with clear rule-based situations (doh!). Both approaches can’t handle any situation that is inherently cognition-necessary.

Where am I? Yes, why do the current AI systems require so much data. I guess the answer is they don’t inherently need to, but through data acquisitions, the AI companies get data they otherwise don’t get.

It is like Alexa getting everything it can hear from your house, and God forbid, images and videos of the goings-on in your house. And the cherry on top is that you are paying Amazon for that privilege. Why is it always listening and watching? Good question. But easy question: because they want to know what you are doing so that they can target you for ads. And once you are conditioned of being watched all the time, hey, they can probably sell the data to who knows who to do who knows what. You are paranoid about the government spying on you? Don’t be. They don’t need to as they can always ask for the data from the tech companies. The tech companies have to produce the data by law in any case. Why spy on you when you ask the tech companies to spy on yourself 24/7. No warrants required as you actually pay for the spying yourself. Warrants are needed when the government asks for the data. Sure. But that data is already in the hand of the tech companies at that point.

So, I heard that Windows 11 will take a screenshot with copilot or something so that it can help you. Oh yeah, I definitely need that help. They recalled that? Damn, I guess they found out the trial-balloon wasn’t too well received? No problems. They will try again. If you are stuck on Windows, you really need to know why that is the case.

--

--

Khun Yee Fung, Ph.D.
Programming is Life

I am a computer programmer. Programming is a hobby and also part of my job as a CTO. I have been doing it for more than 40 years now.