The Unintended Consequences of AI Data Harvesting

Published in

Exponential Era

5 min readJul 23, 2024

Artificial intelligence (AI) offers significant benefits across various fields, revolutionizing healthcare, finance, and customer service. Its power lies in learning from vast data, refining its capabilities continuously. However, AI’s development raises alarming concerns due to its potential to harvest personal information during training and user interactions. This pervasive data collection poses serious ethical and privacy risks, emphasizing the urgent need for stricter regulations and greater transparency in AI data usage to protect individuals’ sensitive information and uphold privacy standards.

How AI Can Potentially Harvest Personal Data

ChatGPT has over 20 million visits daily, and few know what inputting their data on the platform can mean. AI might harvest personal data through various methods. During the training phase, it possibly uses vast amounts of data, including personal information from websites, social media, and other online sources. Web crawlers and scrapers might collect this data, potentially including personal and copyrighted content. Additionally, when individuals interact with AI systems like virtual assistants or chatbots, their conversations and inputs could be recorded and analyzed to improve performance. This likely data collection might help refine AI models but also raises significant privacy concerns, as sensitive information could be stored and misused.

Importance of Data for AI

AI collects and retains data to improve accuracy and functionality. More data helps AI learn and adapt, making predictions and responses more reliable. This data enables AI to understand context, recognize patterns, and personalize interactions, enhancing user experience. Retaining data ensures AI can update and refine its algorithms over time. However, this reliance on data underscores the need for responsible data management to protect user privacy.

AI companies like OpenAI and Meta acknowledge this practice in their privacy policies, collecting personal information to improve AI tools. For instance, ChatGPT was initially banned in Italy over privacy concerns but was reinstated after refining its privacy policies. OpenAI clarified through a post that personal data helps train the model and is not sold or shared, with new options to expunge stored information. While this feature isn’t yet available in all models, others will likely follow suit.

Potential Risks of Data Collection

The issue isn’t that companies sell or misuse your data but that AI tools store it for future use despite being able to ‘expunge’ memories. A New York Times article revealed that ChatGPT can divulge personal information with the right prompts, even after expunging it. For example, the author’s public New York Times email was retrieved. It’s been concluded that emails found in public institutions, university databases, company systems, or accessible social media sites can be vulnerable when AI tools are involved.

As more information about these vulnerabilities becomes available, AI developers may change their models’ approach. However, jailbreakers exist, and these individuals can break an AI’s algorithm, causing it to reveal personal information.

So, what does collected personal information mean for AI? First, it’s important to note the privacy concerns and potential identity theft that can happen from this unauthorized data collection. There are two main ways this can occur: an individual attack on your AI tool account, possibly by a jailbreaker prompting the tool to divulge stored personal information, or an attack on the AI developer themselves, potentially giving access to thousands or millions of users’ chat histories. Both methods can lead to nefarious actors taking someone’s identity and using it for their own gains.

With rising numbers of identity theft in the country, AI has the potential to significantly contribute to identity fraud. If AI can replicate your voice, generate your image, and access your personal data, it could be used for identity theft. This risk arises from AI’s ability to store and utilize extensive personal information. Even with measures to expunge data, AI tools may still access and misuse stored information. Potential breaches of AI developers’ systems could expose users’ chat histories, making identities vulnerable to fraudulent activities. These possibilities underscore the need for stringent data protection measures.

However, one of the most intrusive ways AI can use collected personal information is through its responses. By manipulating opinions and spreading misinformation learned from other users, AI influences behaviour without users’ explicit consent. As AI learns from the diverse and sometimes misleading inputs it receives, it risks reflecting and amplifying these biases in its responses. While many current language models aim for neutrality, they can still tailor information to fit users’ beliefs, leading to confirmation bias. Research from IBM revealed that AI can be manipulated to produce false information when prompted correctly, raising concerns about user control over AI outputs. Recent reports also suggest that AI models such as Gemini may exhibit political biases or ‘woke’ tendencies, influenced by the data on which they are trained. As AI mirrors the population using it, these biases could shape its responses to align with user personalities and beliefs.

As AI increasingly learns from human data, it doesn’t just mirror us — it risks becoming an exaggerated reflection of our collective biases and ideologies. Tools like ChatGPT, Gemini, and MetaAI are not merely passive observers but active participants in shaping a new digital reality, where these AI systems might adopt and amplify our political, personal, and cultural biases. There is a possible future where AI not only reinforces but intensifies our divisions, subtly manipulating users to conform to specific viewpoints or agendas. This unsettling evolution prompts a critical question: In a world where AI moulds itself from the unauthorized data it gets from us, will that data become its architects and subject to the whims of a digital entity that understands us far too well?

— — —

Exponential Era is your source for forward-thinking content about all things Web3 on Medium, powered by Epik. Epik is the the world’s leading IP licensing agency expert in Web3, AI, Metaverse, and the premier agency for brand integration in video games, leveraging the largest digital ecosystem and advanced cross-chain technology.

Follow our socials to stay up-to-date on the latest news and developments on partnerships and collaborations. Telegram, Twitter, Instagram, YouTube, and Official Website.

The Unintended Consequences of AI Data Harvesting

How AI Can Potentially Harvest Personal Data

Importance of Data for AI

Potential Risks of Data Collection

Written by Gary Ma