What is AI? It’s just processing data and returning similar training instances!

4 min readJun 7, 2023

First, data needs to be defined. In particular with computers, data¹ is a sequence of numbers that can represent either numbers themselves or other text characters like letters. Then, these characters together can represent words. Multiple words together can represent n-grams—“a contiguous sequence of n items from a given sample of text or speech”² — such as a phrase like “under the table.” Separately, a series of numbers may also represent a color, for example with the amounts of red, green, and blue³ (e.g., yellow is 255, 255, and 0). When put together, multiple colors can represent pixels in a picture, and multiple pictures can represent a video.

In addition to representing different types of values, multiple instances of data can also be processed into meaningful information⁴. For example, the equation x+1 = y, with an x value of 1, results in 2. Though 2 by itself is an instance of data⁵, in the context of that equation, it becomes information about 1 being added to data. This generalization of data processing can be seen in Figure 1.

Artificial intelligence (AI) is processing data into information in a way that seems “intelligent” or, in some way, similar to how humans do it⁶. AI takes three different forms: symbolic AI⁷, data mining⁸, and machine learning⁹. Symbolic AI is when an AI is fed symbolic, or human-readable, inputs, and returns answers, usually to solve problems like finding something in an environment or planning a task. Data mining is used to identify latent patterns in datasets — i.e., patterns that are not obvious — like customer shopping trends over time. Finally, machine learning is used to identify latent patterns in datasets with results increasing in quality as more data is fed into the system — thus suggesting that the machine is “learning” over time.

Most machine learning systems use statistical models to process the data, and many of these pre-process the data to more optimally get the results they want from the model. Such extensions of data processing are visualized in Figure 2.

Figure 2: Extending data processing with statistical machine learning

These systems can seem like magic, but they’re still just processing data into information. Furthermore, because they are using statistical machine learning, they are just outputting a variation of the training data that is most similar to the new input data. This can be effectively illustrated with a simple machine learning model: k-nearest neighbors (kNN)¹⁰. Systems using kNN return the label of the k most similar instances in the training data compared to the new input data.

While neural networks are more complicated because they propagate these similarity “signals” through multiple layers of “neurons,” the result is basically the same: the classification results are based on the similarity to the training data. That is why the current trend in deep neural networks¹¹ started with the ImageNet challenge in classifying images with labels¹² — e.g., recognizing cats in pictures. More recently, predicting the most likely next bit of text has been popular. Previous attempts at this used natural language processing¹³ algorithms to “understand” the text that has already been entered into a system.

More recently, just predicting the next bit of text has become the priority. Rather than inputting individual characters or words into machine learning models, results are improved by inputting the number of occurrences of all possible n-grams, thus turning text prediction models into large language models (LLMs¹⁴), with upwards of trillions of parameters¹⁵. However, like recognizing cats in pictures, these “autocomplete” systems¹⁶ are just outputting what they were trained on as most likely to follow the input text.

Even worse is when these “generative AI” systems output an image as their results¹⁷. Although these resulting images are not the exact same as the ones they are trained on — many created and owned by visual artists — they are still very possibly breaking intellectual property law¹⁸, and are not “generating” anything. They are just returning a variation on what has already been created.

As a final example, on May 3rd LinkedIn.com launched some generative AI features to suggest text for various parts of members’ profiles¹⁹. Because of how generative AI models are trained, though, this is really just suggesting putting text from other, perhaps more “successful,” LinkedIn profiles onto your own.

In all of these cases, and because many people are nowadays using these systems to answer actual questions²⁰, it might be helpful to emphasize that they “are good at saying what an answer should sound like, which is different from what an answer should be”²¹.

Want to be stay up-to-date on my work demystifying AI? Sign up for my newsletter here: https://datagotchi.net/newsletter/

^ https://en.wikipedia.org/wiki/Data_(computer_science)
^ https://en.wikipedia.org/wiki/N-gram
^ https://en.wikipedia.org/wiki/RGB_color_model
^ https://en.wikipedia.org/wiki/Data_processing
^ A single instance of data is technically called a datum, but that is not commonly used. So I’ll be using instance of data.
^ https://en.wikipedia.org/wiki/Artificial_intelligence
^ https://en.wikipedia.org/wiki/Symbolic_artificial_intelligence
^ https://en.wikipedia.org/wiki/Data_mining
^ https://en.wikipedia.org/wiki/Machine_learning
^ https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
^ Deep neural networks (https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_networks) are commonly used for process and predict text, images, and videos.
^ https://en.wikipedia.org/wiki/ImageNet#History_of_the_ImageNet_challenge
^ https://en.wikipedia.org/wiki/Natural_language_processing
^ https://en.wikipedia.org/wiki/Large_language_model
^ https://www.theatlantic.com/technology/archive/2023/03/openai-gpt-4-parameters-power-debate/673290/
^ https://pluralistic.net/2023/03/09/autocomplete-worshippers/
^ Generative AI systems run on text can also output images, like OpenAI’s DALL-E system (https://openai.com/research/dall-e)
^ https://hbr.org/2023/04/generative-ai-has-an-intellectual-property-problem
^ https://www.linkedin.com/pulse/linkedin-launches-ai-powered-features-profile-optimization/
^ Even some lawyers — https://www.cnn.com/2023/05/27/business/chat-gpt-avianca-mata-lawyers/index.html
^ https://futurism.com/the-byte/ai-expert-chatgpt-way-stupider

What is AI? It’s just processing data and returning similar training instances!

Written by Bob Stark