Vegan AI: What is it and why we should we care?

Agata Ciesielski
8 min readJan 13, 2023

If you follow AI at all or spend any time at all on the internet, you’ve probably recently encountered the unbelievably beautiful AI generated art from apps such as Lensa AI and Dall-E. Like with most AI applications, you might have also heard artists say that AI will take their job. Perhaps something you might not have heard are the voices of the artists speaking up about the ethics of their works being used without consent to train these (monetized) AI models. This quick post will introduce the idea of Vegan AI- the concept of training models on data (e.g., images and other information) that has been obtained with the consent of the original creators.

First, let’s start with Lensa AI specifically. Lensa was trained on a copy of an algorithm called Stable Diffusion, which in turn was trained on a dataset known as LAION-5B. This dataset, is basically a set of images (5.85 BILLION) with associated AI-generated captions (using the CLIP method). Below are some examples.

Examples of images/text pairs from LAION-5B

The LAION dataset was was created using Common Crawl. This non-profit org that has been scraping and curating the web since 2008, then releasing this organized data monthly on AWS. According to their Terms of Use, the data is meant to be for research and collaboration and “not to do illegal things,” including… violation of IP.

The violation of IP is the crucial point, so it is worth a quick reminder. In the US, IP is a form of copyright law that covers patents, copyright and trademarks. Art is generally protected under copyright automatically when it is fixed form. When artists upload their art to sites such as artstation (a source of at least 3000+ images in LAION), these rights are retained. Since it is illegal to webscrape copyrighted work, and it violates the terms of service of the Creative Crawl, it would appear using these data for AI training data, is illegal…

... But of course, there is a caveat: the Fair Use doctrine, where in, under certain circumstances, it is legal to use unlicensed work. Specifically, these circumstances are evaluated by the following four factors.

  1. The purpose and character of the use, including whether the use is commercial or is for nonprofit educational purposes. This usually encompasses non-profit and non-commercial services, mainly research and/or to extend the original intent of the copyrighted work.
  2. The nature of the copyrighted work. This calls to question the amount of creativity that went into the work. Generally speaking, works that have a high amount of creativity such as novels and songs are less likely to fall under fair use. Whereas the factual sources used to help research this post are under fair use. (Facts are not considered IP.)
  3. The amount used in relation to the work as a whole. This factor addresses the question: “what portion of the work is being used?” For example, is only a small percentage of the work being used (eg, a paragraph from a novel) and how critical and original is this portion to the value of the work.
  4. The effect of the use upon the potential market for the copyrighted work. In this case, the the legality of the use depends on the how hurtful the usage of the work is on the creator. For example, does the usage deprive the original creator of income. Think: Napster, the popular P2P sharing service that was found to violate IP in the early 2000s.

Datasets are generally created by student or early career researchers. At this stage, there is usually no intent for algorithms or their products to be monetized, and therefore, no urgent need to think about things such as IP. In order to allow for reproducibility, these research groups publish and share the algorithms and data to allow for others to build upon their work. At this point things are still pretty kosher. TLDR: Under fair use, for AI research, webscraping is legit.

Here’s where things get very fuzzy. When subsequent for-profit organizations create new products, they use what is out there, starting with the publicly available algorithms and data. Then, based on their application, they alter the training dataset and create models that meet their specified needs. Lensa, like many companies whose goals extend beyond research, has clearly has clearly added data to their training set. While they note that they incorporate 10–12 images given to them by the user, beyond those, we do not know what other images were added to the training set.

Yet when you start to use these algorithms, you begin to realize parts and features from other work are incorporated into the outputs. For example, when using Dall-E 2, I asked for “a painting in the style of Monet,” the below results were returned.

Dall-E 2 results for: “a painting of a surfer in the style of Monet”

We can clearly see there are elements of Monet’s style in this image, though we know Monet did not paint surfers.

When it comes to Monet, whose final works were painted in the 1920s, there is no current-day IP protection. US law currently protects IP for up to 70 years past death. Things are more consequential when it comes to copying work of artists and creators that are much alive and still depend on their unique skill for a livelihood. As cited in this work, one such artist is Greg Rutkowski*, who has a very distinct style. However, according to the article, his work has been used to generate 94,000 images using Stable Diffusion.

*Note: Greg Rutkowski is an artist based in Poland, a country that has signed the Berne Convention — a concept of international IP not presently covered in this piece.

“Dragon Cage” by Greg Rutkowski. Greg Rutkowski
Images created when Insider typed “Dragon battle with a man at night in the style of Greg Rutkowski” into Stable Diffusion.

When it comes to IP, and re-examining fair use, (pretending for a moment that Greg is a US based citizen) we are drawn to section 3. You would think these AI-generated images, are derivative works, since the users are asking the algorithm to generate images in another artist’s style and under the derivate works laws, “Only the owner of copyright in a work has the right to prepare, or to authorize someone else to create, an adaptation of that work.” However, as it turns out, simply copying someone’s style isn’t a violation of copyright.

The one happy note for artists is that, technically, AI-generated art is NOT copyrighted. Therefore, the end user, does not retain the rights to these new copy-style images. However, this does mean that anyone can use these new images to generate income as they wish. While this would appear to violate section 4 of fair use, unfortunately, it does not appear to at the present moment.

Enter Vegan AI.

As we have seen from the arguments laid out, unfortunately, using images (and text and other works) does not violate most IP laws. But undoubtedly, it is ethically fuzzy. This is where we re-introduce the idea of vegan AI, the concept of training models on images (and information) that has been obtained with the original creator’s consent.

Creating datasets is a heavy and expensive lift. To give you an idea: I’ve created many in my day and use to set aside around $20K in the budget for a custom dataset with 1,000 labelled images. So to obtain consent on large datasets such as the 5.85 Billion images contained in LAION5B would undoubtedly be a timely and financial burden, especially as (at the moment) it is legal to use these images for research purposes.

However, obtaining consent could be done on a smaller scale (at least initially) with images obtained from free stock photo sites or known public domain images. Some entities, such as the National Gallery of Art have created their own collections of downloadable Open Access images. While it still takes a bit of work to convert these images into AI-ready datasets, there has been some movement forward in this area. An example of a dataset that has attempted to use only Open Access images is OmniArt which contains 432,217 images from 21,364 artists.

In some cases, orgs overseeing these larger datasets have attempted to help with ethical concerns. Recently, OpenAI, the creator of Dall-E, has stated that they have licensed some of the copyrighted images. However, it has been difficult to find confirmation of this outside of a few articles which do not feature direct quotes from OpenAI.

Alternatively, one artist has created a reverse search of the Stable Diffusion dataset for artists to opt-out of being included in the dataset in the website Have I Been Trained. However, for artists that have tried using this, it appears that this opt out process can be extremely tedious and clunky. More significantly, this only works on one of the very many public datasets and does not encompass the data that is added by private companies such as Lensa that have no legal obligation to share their datasets.

Other companies, such as DeviantArt have disallowed the scraping of their website for AI training unless the artists specifically opt in. Based on the law (though, I should highlight that I am NOT a legal expert!), it is a bit unclear whether this bold statement is legal and enforceable. However, it does offer a clear path for dataset creators to at least check for consent.

So… what is Vegan AI and why should you care? The fact is that, while it is currently legal to use artists’ work in AI that is research-focused, many times these works are extended into commercial applications without consent, which is hurting artists. To be fair and ethical, there needs to be a balance between the research benefits to a community as a whole and the respect of creators of IP. In this light, I am compiling a list of Vegan AI datasets. If you have datasets to contribute, comment below.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

ps. One area that I will cover in a subsequent post is the notion of bias. Specifically, will we create more bias with Vegan AI, knowing certain groups are more likely to consent to data sharing?

Quick note: This piece took a good amount of time to research and thank you to my friends, family and Bill Powers for help with editing.

--

--