AI’s Gold Rush: Monetizing Copyrighted Data

Juan Romero
LatinXinAI
Published in
5 min readJun 16, 2023

As you may know, to create an efficient and powerful AI model you need a big and organized dataset to train it from. This is a MUST for every model out there. Have you ever asked where is all of this data coming from? Well, according to some estimations made, there are around 64 zettabytes of content out there, one zettabyte is :

(yes I wrote that by hand to experience the pain, pardon my funky 0’s)

And we need 64 of those, we all get it right? It’s definitely a lot of information out there. It is really no hard task to organize the data that comes from research due to the immense amount of frameworks out there to help the process. What happens when an AI seeks to know it all? Well, it basically scrapes through a selected portion of the 64 existent zettabytes and takes it to learn from it. Ask yourself one more question, who has created this massive amount of information? … Yes, it was us.

This means that the artists that once uploaded their content in a seek to capitalize their creativity are now getting stolen their most valuable asset ex: Rappers getting stolen their intellectual property by random TikTok accounts (and let me tell you, some of those were even better bars than the original rappers)

Are there any benefits to this?

It is most definitely hard to find a benefit to companies stealing people’s work or talent to make money, but what if (and only if) those companies allowed you (and only you) to automate your content? for example, just imagine how easy to write, record and publish a song would become. There are a lot of ethical concerns and legal implications to this.

Moreover, the fight for Intellectual Property is one that has been fought for a long time now. A very famous case is pirate bay, a web platform that used a BitTorrent protocol with P2P sharing and literally made it hell for the owners of the content. Technically they were not infringing upon the owner’s property because they didn’t HOST any of the content, does that make it ethically correct though? Not really (Still love and admire the innovation though).

Is there a solution to this?

Companies that scrape negligently the web are infringing upon a lot of people’s ideas and content. “Cutting the water supply” isn’t a viable solution in this case due to the high complexity that this would bring to how we navigate content and learn. As a blockchain lover, Proof of Humanity is a very interesting concept to me. You could verify yourself as a human. Unlocking a lot of new possibilities on the web, where to consume content you’d have to verify your humanity. This still sounds a bit weird, it makes me question: Is this even necessary? Should ethics be the new way of ruling this whole Intellectual Property stuff?

There is a lot to discuss with all of this technology and its smooth sailing into our daily world, and as a tech enthusiast, It is exciting to see how fast AI and all its derivate subfields are growing, however, I understand that we can’t let AI take over people’s work and never if we know how companies are using them. We have already let companies do too much with this information. However, coming from an underdeveloped country, I have seen how regulations lead to corruption, that’s why I think that instead of “strictly” regulating we should attack technology with more technology and fight for our Intellectual Property, even if we are not famous rappers, we have shared ideas with the world across this web, in a hope to make it a better place or simply to create, entertain and monetize.

A look into: Legal Implications

I am going to divide this section in two parts: The United States and The Europe Union. Given that they’re known for their regulations on everything. The Europe Union started some of their action in 2019–2020, including permitting the use of copyrighted information for the sake of Research. As of 2023, they have added a closure to the bill that asked developers to disclose the data that has copyright, limiting the access to the public and making a lot of projects that sought to be Open Source legally impossible. Under Article 3 of the bill, they made it possible for developers to use this data only if the model was used for research on Generative AI. On the other hand, the US, under the efforts of Senator Michael Bennet has introduced a proposal to create a special cabinet-like task force to ensure specific laws for the work on AI and protect the Civil Intellectual Property. It is clear how both of these powers are making active efforts to regulate AI, the fear of being replaced by this technology is something that only has increased with time. Now that AI has been trained with content that has Copyright and we have seen the results, big companies feel their back against the wall, when this could be a huge improvement for their working tools and if I am being honest, it is hilarious to see senators coming up with accusations/questions for the big tech companies CEOs.

In conclusion, there is a lot of regulation and challenges to come. Our world is changing, but for the best. Automatization is the dream of every developer out there and I believe this would greatly impact humanity and its creative people out there. Having your suit of your own talent automated, should be something that excites us all. It only needs the right applications and we are here to develop it, not to be completely replaced by it.

References

Margoni, Thomas, & Kretschmer, Martin. (2021). A deeper look into the EU Text and Data Mining exceptions: Harmonisation, data ownership, and the future of technology. Zenodo. https://doi.org/10.5281/zenodo.5082012

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

--

--

Juan Romero
LatinXinAI

Just a high school student, aspiring engineer, activist, researcher and tech-bro!