AI Programs Will Only Succeed So Long as We Protect Original, Human Created, Content

Stuart Johnson
b8125-fall2023
Published in
4 min readNov 16, 2023

Given the rapid progression in adoption of Artificial Intelligence (“AI”) and AI produced content, society runs the risk of these programs creating circular content over the next decade, stifling invention and innovation. The core challenge is how these AI, or more specifically machine learning programs, develop the content they produce. AI as we understand it today is a far stretch from the original objective, replicating the varying ways that neurons interact in the brain in an attempt mimic brain functionality in originating original ideas, understand concepts and sharing / processing information. Given the complications of such an undertaking and the significant resources required to attempt it (given the huge potential risk of failure) this initiative stalled out. The programs we are familiar with today, OpenAI, Bard, etc., are commonly referred to as AI programs when in reality they more closely resemble machine learning programs which are trained on trillions of parameters. These parameters are variables which are readjusted during training to establish how the input data gets transformed into an acceptable output. Ultimately, the parameters and data being used are based on information collected and scrubbed from numerous sources which are typically available on the internet. While these machine learning programs may not have real-time awareness of events or changes in information, the responses and “original” content that is being generated is most often an adaption, collection or some combination of the information which the model was trained on. The concept of how machine learning programs that produced text/written content is translatable to visual content. Simply, creating images is done by using k-Nearest Neighbors algorithms (or other similar formula that selects information that is the minimum distance from the desired output) to find the closest visual match to the information provided in the picture and recognized through the large and diverse dataset of images these programs are trained on.

This method of aggregating massive amounts of data to be used in the training of programs has a fatal flaw. As these programs are used more and more to create new content for the internet, this content is then in turn used to train updated and advanced versions, thus becoming circular. Eventually, the amount of original, human created, content becomes a more insignificant portion of the training data. It is not impossible to foresee a future, especially given current estimates project that over 90% of online content will be generated by these Machine Learning programs by 2025, where the majority of new data available is no longer original, reviewed, created and unique. This situation would result in a circular reference, for these programs, where the collective creative and inventive output would no longer become as unique or additive as they are today. The tools which these programs use to create the significant technological advancement we are experiencing and their potential impact to society could ultimately lead them to becoming obsolete and significantly less beneficial. Similar to how, following Elon Musk’s acquisition, X (Twitter) which was once a great source of up to the minute news, became a bastion for misinformation and became significantly less useful when diluted with more, unreliable and non-unique information. While these programs are a significant leap forward in technology but also, due to the data used in their creation, they require us to be mindful of how to make sure that we do not dilute their usefulness and limit the potential repercussions of them taking a leading role in the information we consume.

To ensure the longevity of and enhance the quality of these programs, we must ensure that individual contributors whether it be artists, writers, researchers or authors, have the necessary safeguards in place to ensure that they are incentivized to continue to create substantive and interesting new pieces of work. Having a collaborative effort between those creating the content and those using the content to create these programs is the best path forward for both parties. This ensures high quality data to be available for training these models and incentive for contributors to continue to develop original and unique information. Alternatives are available to us to ensure that we do not dilute the quality of the work used to train the programs, such as:

1) Using expert reviews when evaluating the sources of information and not relying exclusively on the data available as it could be wrong or was generated by a similar machine learning program.

2) Mandating data generated by machine learning programs have an encoded tag so that users and consumers are aware that this information was generated from a machine learning program.

3) Citing the sources of the information used in the creation of the materials. This is particularly important in Machine Learning generated art as there have been numerous cases of more art being generated by these programs than the artist has made themselves.

4) Limiting the capabilities of the programs to rely exclusively on trusted and proven facts to prevent the programs from generating facts or information that is plausibly true but is not verifiable. This would prevent cases like the lawyer who used precedents generated by ChatGPT which were made up.

While we can all be vigilant of the information, we consume in this era of machine learning programs there is still much to be done to ensure that these tools are used correctly and created in a way so as to create longevity in their usefulness. These programs should be additive to the creative and inventive process, a tool for bolstering human intelligence, not a replacement for it.

https://futurism.com/the-byte/ai-internet-generation

--

--