Can You Leverage Computer Vision To Make Money With NFTs?
If you’ve been keeping an eye on emerging tech over the last few years, you’ve probably heard of NTFs. Million-dollar auctions and stories of people making life-changing money flipping NFTs have attracted the attention of artists, collectors, and now with the resurgence of the metaverse, investors, and brands.
Although the NFT craze reached (near-)critical mass in late 2020 — early 2021, NFTs have been around for a while. In spirit, we have had NFTs for decades in the form of collectibles, artworks, and trading cards. Even the digital asset version has been with us since the beginning of the 2010s.
It was the creation of the ERC-721 and ERC-1155 standards and the emergence of secondary market places like OpenSea and Rarible that caused the explosion into mainstream adoption.
In this article, we’ll discuss the role of computer vision in NFTs and use GANs to create some NFT worthy artwork. But before we get into that, let’s briefly go over what is an NFT.
What are NFTs?
NFT stands for Non-Fungible Token. But that’s not really helpful at all as the term ‘fungible’ leaves a lot to be explained. In simple terms, it means replaceable or exchangeable. If something is fungible, it can be replaced easily. The phone or laptop that you’re currently reading this on is quite fungible. Sure, there might be some personal data but the device itself is replaceable. You, on the other hand, are non-fungible.
In computer science terms, NFTs are bits of data stored on a (preferably decentralized) blockchain digital ledger. NFTs act as a certificate of authenticity. What makes NFTs really promising is that the verification of authenticity doesn’t explicitly require a centralized authority, just code. Ideally, NFTs cannot be changed or stolen, thanks to the cryptography that makes the blockchain secure. But there’s always the human vulnerability vectors (*coughs* BAYC seed phrase phishing scams *coughs*).
Now, how blockchains work, how the owners of NFTs can enforce their proprietorship (against right-click save exploits 😬), and the general utility of the tech is something we not going to look into. As I see it, given an appropriate amount of time, the free markets will separate the wheat from the chaff.
Generative Adversarial Networks (GANs)
Generative Adversarial Network (GAN) is a class of deep learning frameworks designed by Ian Goodfellow and his colleagues at the University of Montreal. Usually, GANs consist of two neural networks, a generator and a discriminator competing with each other in a game.
The core idea of a GAN is the indirect training of the generator through the discriminator, which is responsible for telling how ‘realistic’ an input is. But the discriminator network itself gets updated dynamically over time. This means that the generator is not trained to minimize the distance to a specific image, but rather to fool the discriminator. This enables the generator to learn in an unsupervised manner.
But the problem with using GAN to create things is guiding the generation process. Let’s say we somehow train a GAN to generate impeccable recreations of artworks like The Death Of Socrates and Almond Blossoms, the problem now becomes how we reliable generate what we want to generate.
You see the generator network of GANs consists of an encoder and a decoder. The encoder network takes input images and learns their vector representation in the latent space.
So given a bunch of photographs of dogs and cats, a good encoder network will find two groups of vectors in the latent space corresponding to the two types of images.
Now if we simply sample random vectors from the encoder's latent space we’ll get unsatisfactory results. To properly leverage the generator network we need to be able to guide the generation process effectively. So, when we want to generate an image of a cat, the decoder in the generator network knows where in the latent space it needs to sample from. This is where OpenAI’s CLIP revolutionized the world of generative artworks.
CLIP
Simply put, CLIP can be thought of as a bridge between computer vision and natural language processing. Computer vision models aim to understand and see things like humans, and similarly, NLP models try to understand natural language in a manner similar to humans. But the two domains are intertwined. When we see a dog we can label it as a ‘dog’ and when we read the word ‘dog’ the image of a dog pops up in our head.
To connect images and text together, they need to be embedded in a vector space. Even if the idea of embeddings seems new to you, chances are you’ve probably worked with embeddings before. Ever created an image classifier? Well, every layer except the last layer creates embeddings for the image. Worked with text data? Well, you’ve definitely used some models like Word2Vec or Sentence Transformer to create vector embedding for the text.
Let’s say you have an image of some apples and bananas, you can represent that as a point on a 2-D plane like this:
Now, this doesn’t seem very useful, but we just embedded the information from the image in a simple two-dimensional vector space. If we wanted to include the information about the hand, we could add another feature and make a three-dimensional embedding for the image. In practice, we want to encode more information than just the presence or absence of objects. So, encoders have way more dimensions, think 100-dimensional hyperspheres.
The CLIP model consists of two encoders:
- a text encoder to embed text into the latent space.
- an image encoder to embed images into the latent space
The combination of these two encoders enables CLIP to learn generalized cross-domain text-image relationships. Once the CLIP model has been trained it can be used to guide the image generation process of GAN models.
How CLIP Helps Us Generate Images From Prompts
The following represents the simplified architecture that we’ll use later to generate images from prompts using the CLIP and GAN models like BigGAN or VQGAN.
The idea is simple, we start with random values for the latent vectors from the GAN’s latent space to generate an image. This will usually generate a noisy image, this (initial) image will be passed to CLIP along with our text prompt. CLIP will generate a score representing how well this image represents the prompt.
Using this score, the values of the latent vectors are updated using gradient descent and backpropagation as if they were weights in a neural network. Then the GAN uses the updated latent vectors to generate another image that will repeat this cycle over and over until CLIP decides that the generated image sufficiently resembles the prompt.
In simple terms, the GAN generates the images, while CLIP judges how well the image resemble the text prompt.
Create Art Using GANs
There’s a wide array of GANs you can use to create, so it can be a bit confusing to start. In my experience, the easiest place to start is the BigSleep package. BigSleep is an easy-to-use wrapper module that makes Ryan Murdock‘s combination of BigGAN and CLIP accessible to everyone, even total newbies.
BigSleep
Install the BigSleep module
pip install big-sleep — upgrade
Generate images from text prompts
from tqdm.notebook import trange
from IPython.display import Image, display
from big_sleep import Imagine
TEXT = ‘dream of a thought’
SAVE_EVERY = 100
SAVE_PROGRESS = True
LEARNING_RATE = 5e-2
ITERATIONS = 1000
SEED = 0model = Imagine(
text = TEXT,
save_every = SAVE_EVERY,
lr = LEARNING_RATE,
iterations = ITERATIONS,
save_progress = SAVE_PROGRESS,
seed = SEED
)for epoch in trange(20, desc = ‘epochs’):
for i in trange(1000, desc = ‘iteration’):
model.train_step(epoch, i)
if i == 0 or i % model.save_every != 0:
continue
filename = TEXT.replace(‘ ‘, ‘_’)
image = Image(f’./{filename}.png’)
display(image)from tqdm.notebook import trange
from IPython.display import Image, display
from big_sleep import Imagine
TEXT = 'upside down tree'
SAVE_EVERY = 100
SAVE_PROGRESS = True
LEARNING_RATE = 5e-2
ITERATIONS = 1000
SEED = 0 model = Imagine(
text = TEXT,
save_every = SAVE_EVERY,
lr = LEARNING_RATE,
iterations = ITERATIONS,
save_progress = SAVE_PROGRESS,
seed = SEED
) for epoch in trange(20, desc = 'epochs'):
for i in trange(1000, desc = 'iteration'):
model.train_step(epoch, i)
if i == 0 or i % model.save_every != 0:
continue
filename = TEXT.replace(' ', '_')
image = Image(f'./{filename}.png')
display(image)
You can play around with this code and create your own artwork in this notebook.
Here are a few things I was able to muster up in a few days of tinkering:
An interesting idea worth exploring is creating a video or GIF of the output images from the intermediate iterations, i.e., basically the whole creation process. Take a look at this GIF I created with the prompt “castle of glass”:
You can make something like this using the output images from BigSleep and the Pillow module:
import glob
from PIL import Imagename = "_".join(prompt.lower().split(" "))# filepaths
fp_in = f"/path/to/{name}_*.png"
fp_out = f"/path/to/{name}.gif"
# https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#gif
img, *imgs = [Image.open(f) for f in sorted(glob.glob(fp_in))]
img.save(fp=fp_out, format='GIF', append_images=imgs,
save_all=True, duration=200, loop=0)
If the basic notebook doesn’t cut it for you, you can explore the original notebook that gives you more granular control over the whole process.
VQGAN+CLIP
If BigSleep doesn’t quench your thirst for generative art, you can give VQGAN+CLIP a try. Where almost all BigGAN generators are trained on the ImageNet dataset, there are VQGAN based text-to-image generators trained on bigger more diverse datasets such as wikiart, sflikr, etc. Architecture-wise, VQGAN leverages transformers to model more global dependencies in the images allowing it to produce more coherent-looking images.
Another noteworthy thing about the VQGAN+CLIP generation paradigm is the ability to dictate the style of the final output. Say, you want an oil painting aesthetic, you can add that to your prompt and GAN will take that into account. In addition to that, there is the option to guide the generation process with an init_image
and a target_image
.
Take a look at this small video I created using a quote from one of my favorite games, Shogun 2:
There are a lot of VQGAN+CLIP variants out there, you can start with this notebook, or refer to this list of a bunch of variants. There are also a bunch of no-code methods in the list, so if that’s your thing, I’ve got you covered 😉
Voila! You have the basic tools to create something worthwhile. Sure it will take a whole lot of experiments and waiting, but if, like me, you can’t draw or create visual art on your own this is a good way to create things.
Unless you’re rich enough for Etherium layer 1 (read millionaire), I recommend you stick to minting your NFTs on cheaper layer 1s like Solana or Terra, and if you really love the Etherium security go for layer 2 chains such as Polygon or Arbitrum.
If you learned something in this article and want to get into Computer Vision, then check out our extensive computer vision courses HERE. Not only do we offer courses that cover state-of-the-art models like YOLOR, YOLOX, Siam Mask, but there are also guided projects such as pose/gesture detection and creating your very own smart glasses.
References:
[1] BigGAN GitHub
[2] CLIP Paper
[3] BigSleep GitHub
[4] BigSleep Notebook
[5] VQGAN Paper
[6] VQGAN+CLIP Notebook
[7] Illustrated VQGAN
[8] List Of VQGAN+CLIP variants