NFT rarity? That’s just aesthetics with extra steps

Published in

PopRank

12 min readMar 24, 2022

A few months ago, we published an article detailing our approach towards NFT rarity. As we’ve onboarded more and more collections, I’ve been noticing that our rarity algorithm seems to work better for some collections than others. This led me down a rarity rabbit hole, trying different calculations and looking at what our peers in the space do, and much like Alice, at times this rabbit hole seemed to raise more questions than it answered.

The bottom of this rabbit hole, the cake at the end of the tunnel, was the decision to open-source our rarity rankings, but let’s go on that journey together.

Interesting rarity discrepancies

Let’s look at the Azuki collection, whose aesthetic I absolutely adore. I noticed something surprising with one of the rarest items:

Rarity Sniper Azuki rank 5 -Azuki NFT Rarity Ranking | Rarity Sniper

Looking at its traits and their scores, it appears that “Type — Spirit” has a rarity score of 9845, and “Special — Lightning” has a rarity score of 7958, despite “Type — Spirit” being more common than “Special — Lightning”. What’s going on?

I then looked through the rarest Wicked Craniums (shoutout to the community whose hackathon helped kickstart PopRank). Let’s look specifically at the first NFT’s “Body — EmeraldO” trait and the second NFT’s “Eyes — VR” trait.

Wicked Craniums Rarity Sniper rank 6 — The Wicked Craniums NFT Rarity Ranking | Rarity Sniper

Wicked Craniums Rarity Sniper rank 29 The Wicked Craniums NFT Rarity Ranking | Rarity Sniper

Wicked Craniums Rarity Sniper Body — Emerald trait rarity

Wicked Craniums Rarity Sniper Eyes — VR trait rarity

Once again, we see the same discrepancy.

I checked rarity.tools too to see whether it’s a one-off, and noticed the same phenomenon, of “Eyes — VR” having a higher score than “Body — EmeraldO” (117.29 and 111.41 respectively), despite “Eyes — VR” being less rare. Rarity.tools released an article in May ’21 outlining their rarity calculation, which I loved. It provided an amazing level of transparency into their approach to rarity and spoke about the benefits/drawbacks of different methods. The article mentions being the first of a series of articles further diving into their rarity approach but was the only one to be released. The article outlines an inverse frequency rarity score calculation, which has nothing that would lead to what we’re seeing.

Wicked Craniums rarity.tools Body — Emerald trait rarity Wicked Cranium #1175 — The Wicked Craniums | rarity.tools

Wicked Craniums rarity.tools Eyes — VR trait rarity Wicked Cranium #1359 — The Wicked Craniums | rarity.tools

I thought about this for a while, and came up with these possible answers as to why this is happening:

Rarity platforms all have similar bugs in their rarity calculation

Unlikely — both as it isn’t localised to just one platform, but also because I have confidence in the engineers of these rarity platforms.

There’s some manual “massaging” of rarities depending on the collection

Due to the existence of the “These rankings were verified by the collection owners” check, this seemed possible. What does “verification” entail? How does a collection verify that the rarity is correct, and what happens if they don’t think it looks good?

Rarity platforms are taking into account more than just the relative frequency of traits

What if the rarity score we see takes into account more than just the “1/N” frequency of the trait? It could look at the distribution of different traits within a trait type or the distribution of trait types within a collection. This would definitely explain the behaviour we saw above! This feels like the most likely answer, but I still wasn’t satisfied and I still had no idea what a collection’s “verification” of rarity rankings involved.

My question for you, the reader, is have you noticed this before? Were you aware that a more common trait might sometimes be rated as rarer than a less common one? Do you know what a collection “verifying” the rarity rankings entails? I know that I hadn’t before this, and this concerned me. Some of my financial decisions had been informed by these rarity calculations, which I now realise I didn’t understand.

How does rarity differ across platforms?

My investigation originally ended here. After I’d written the conclusion but before I’d had a chance to publish our public rarity package, I was chatting with Vasa (@vasa_develop), an absolute pusher of product, from Gem (@gemxyz). They’d recently added rarity to Gem, and Vasa had had a similar experience to me in doing so. He brought up yet another surprising aspect of the current rarity landscape — different rarity platforms all converging on the same rankings. Evidently, platforms are using proprietary calculations that take into account more than the individual trait’s frequency, so we would expect to see noticeably different rankings across platforms… right?

I thought so too, but different platforms seem to have converged on largely identical rarity rankings (when the rankings aren’t “verified”, more on this shortly).

As an example, here are the rarest NFTs in The Littles collection (I’m omitting the first row as all the 1/1s are rank 1).

Rarity Sniper The Littles rankings the littles NFT Rarity Ranking | Rarity Sniper

Rarity Sniffer The Littles rankings the littles NFT Rarity Scores | RaritySniffer

Wow — two different platforms, yet the exact same rankings.

The light at the end of the tunnel

After I’d thought of at least 4 different rarity conspiracy theories, Vasa by way of nftsensei.xyz’s Trontor#2268 (Discord)discovered the rarity calculation that many different sites seem to all be using.

I’ve now implemented it in our public rankings package (which anyone is free to consume/fork/share a lovely dinner with), and as you can see, we now get the exact same rankings for The Littles too. Our rankings are now the same as all non-verified collections on other sites such as Rarity Sniper and Rarity Sniffer, proving the usage of this calculation.

With Vasa’s OK, here’s the formula:

EDIT: May 3rd — In working on our rankings package, I stumbled across Rarity Punks Content — An Open Standard for Rarity Rankings (storage.googleapis.com) by RarityPunks (https://raritypunks.io/), which seems to be the source of the above calculation.

Here’s a more verbose version of the trait score for trait j’s ith trait value, and an example:

Formula for an individual trait written in more detail

This formula not only takes into account a trait’s inverse frequency; it also takes into account the number of unique values of each trait. Traits that have more unique values, i.e. there are 5 different “Hat”s but 42 different “Shirt”s, will have lower resultant scores. In my opinion, an improvement to the calculation would also be to take into account the number of “None” vs non-”None” values of a trait, but I digress.

Note: The formula is simpler if we treat “Trait Count” as just a normal trait, removing 4 of the variables described in the original image, which is what we do in our implementation of this.

Another note: The “None” value of a trait can also treated like a trait in and of itself.

The most interesting component of the above calculation is Wⱼ — a trait’s weighting, a constant by which we multiply the rest of the trait value’s rarity score. What we did with Wⱼ was: after we’d parsed a collection’s traits, we went through each trait and all of its values to find the trait value with the lowest rarity score in the collection. We then chose a Wⱼ that we’d apply to every trait in the collection, such that the lowest rarity score is 1. As we use the same constant Wⱼ (or just W) for every trait, all it’s doing is scaling the end result equally for all NFTs such that the numbers are more aesthetically pleasing — no one wants to see a trait rarity score of 0.02.

While we treat Wⱼ as simply W, applying it to every trait equally across the whole collection, by definition it could be used to arbitrarily apply different weights to different traits. Is this the mythical “verification” of a collection’s rarity rankings?

I noticed the exciting new collection, Muri, had been “verified” on Rarity Sniper. Here’s what the rankings look like on our site, with the recently discovered formula applied:

PopRank Muri rankings https://poprank.io/murixhaus

Rarity Sniper Muri rankings MURI by Haus NFT Rarity Ranking | Rarity Sniper

Interesting — it looks like there are 2 NFTs that happen to score more highly than all the 1/1s in the collection, which are all in 3rd place, when a constant W is used. On the other hand, on Rarity Sniper has all the 1/1s as equal 1st.

My theory was that the “verification” process involved a collection looking at the default rankings, and if they didn’t like them, assigning arbitrary weights to different traits such that the NFTs they want to see at the top are now at the top. I can understand how in this case, a collection wouldn’t want 2 random-seeming NFTs in the 1st and 2nd spots, ahead of all the 1/1s. This default result feels bad. To prove this, given the trait’s frequency and number of unique values, the collection size, and the end trait value rarity score, i worked backwards to calculate the Wj component of the formula.

Manually calculate the weight for different Muri traits, comparing PopRank and Rarity Sniper

I’ll summarise the above image for those who don’t both feel like they peaked in year 11 maths class and are trying vainly to recapture the simplicity and surety that came with solving questions that had distinct answers. I mean, uh, here’s what’s happening in the image.

For both PopRank and Rarity Sniper’s rarity calculations, I worked backwards to solve for W and find the individual weighting of a trait. Unsurprisingly, and verifiably in our public rankings package, we have a constant W for all traits in a collection. Rarity Sniper on the other hand have a constant weight for all traits (326) but the “Legendary” trait (391.2), which is synonymous with 1/1s. With a constant W, we see the 1/1s being in 3rd place, so assumedly the “Legendary” trait’s weight was bumped up just enough to ensure that all the 1/1s were first.

PopRank now had rarity rankings that were comparable (or identical) to many other players in the space, but I still have one final question: does that mean we now had good rarity rankings? Evidently, collections have an idea of good and bad rankings, and use that to inform individual trait weights. I was still confused as to what I believed a good rarity ranking even meant though.

How even is rarity judged?

What began as a question of “how do we make our rarity rankings better?” evolved into a deeper, more fundamental question — “How do you even judge the success of a rarity calculation?”. Individual trait weighting speaks to the fact that collections have their own idea of what the rarity should look like.

As someone that developed their own rarity calculations and rankings for PopRank, I felt this firsthand. My workflow was something like:

Try a new calculation
Update the rankings
Look at the rankings and judge if they seem rare enough
If it doesn’t look rare enough, go back to 1.

Let’s run it back. I, and I’d argue every other developer of a rarity calculation, would look at the rankings and judge if they seem rare enough.

This might sound like a facile argument, but really think about it. We’re taking a statistical calculation, and judging it aesthetically. We’re not looking at spreadsheets of the scarcity of traits compared with the NFT’s rankings to evaluate a given rarity calculation. We look at the rankings and judge whether they’re rare or not. Even if we have some knowledge of the relative frequency of traits, when we evaluate the end result by looking, our evaluation is inescapably aesthetic.

I seem to have finally stopped hurtling down the rabbit hole, and next to a cake that says “eat me”, was this conclusion:

Rarity is a proxy for aesthetics, masquerading as a statistical calculation.

How can an artist decide on which traits to make rarer than others, which parameters to tweak, if not by the aesthetic impact it has on the piece? It makes perfect sense, and I reached out to some generative artists and collection owners to get their input. One of them shared a wonderful article that deep-dives into the patterns we can see in some of the biggest generative art collections, and unsurprisingly, it focuses on how collectors should engineer their rarity based on the aesthetics of the pieces. If that’s not enough, rarity.tools even released an article advising collectors that “You don’t want your top/most rarest NFTs to be a bunch of bad looking ones!”. The key takeaway for me was that of course artists are going to be aware of rarity dynamics and leverage them to create the best collection possible.

How does this impact PopRank’s roadmap?

What really struck me during the research for this article was the general lack of transparency. There’s very little information available to the users of rarity platforms to help them understand why one NFT is ranked higher than another. Rarity, being one of the most objective ways by which to rank NFTs within a collection, is the easiest to make transparent, yet it’s not.

I do want to mention how great it was that rarity.tools published an article on their rarity calculation. I can empathise with what I assume is the reason they never released the series of articles they said they would. Writing articles takes a while, and often can feel less impactful than simply writing code. Additionally, If they ever were to tweak/change their rarity calculation, it would be a pain to publish a new article every time, explaining why you changed one constant from 1.4 to 1.5.

That’s why we’ve decided to make our rarity and aesthetic rankings completely open-sourced. If you’re using a calculation to decide which NFT to purchase, you should be able to see exactly how it works. This package is the source of truth for all the calculations on our site, as it’s the package we’ll be consuming. Any changes made will be well explained there, and everyone will have visibility into the full git history. Code is truth.

The first of the packages we open-sourced was @poprank/rankings. This package contains all of our rarity and aesthetics rankings code, as heck, it should all be transparent. We even added an example file that will calculate the rarity rankings for all NFTs in a collection and output a simple visual of the top 100. You just need to feed it a JSON object that has an array of all the NFTs in our desired shape. Our end goal here is to build a system where users and collections can actually build their own rarity rankings, and plug those into our site to customise their experience.

This still didn’t feel like quite enough for us though, so we also open-sourced the OpenSea API wrappers we used in the past to query OpenSea for NFTs: @poprank/opensea. This package has methods that will grab all the NFTs for a collection via OpenSea and then transform them into our desired shape, ready to use with our rankings package!

The @poprank/rankings and @poprank/opensea packages are public NPM packages with MIT licenses.

Conclusion

Our primary takeaway was that we will continue focusing on our aesthetic-based offerings now that our rarity calculation is on par with most other players in the space. Rarity feels like it’s trying to quantify the aesthetic impact of traits on the end result, but without actually taking into account the aesthetic impact. Even more so there’s no perfect rarity calculation, as every collection is different.

Some collections, such as MountVitruvius’s Mind The Gap have traits that have no aesthetic impact on the piece. Other collections like Bastard Gan Punks V2 have traits that are so wild that I can’t even begin to comprehend them (what does a C̣ͦͧU̺̱̫RͨͣͫŠ͎͞E᷿ͯ͂Dͩ͢ BASTARD even begin to look like?). It feels too homogenous, too impersonal to properly capture the nuances of different collections. Why focus on a proxy for aesthetics when we can focus on the aesthetics themselves?

We believe that the value of a collection lies in the connection between the artist and the collectors. It’s a two-way relationship, where both parties are projecting some of themselves and their experiences onto a piece and truly connect with the pieces. It seems that the majority of other people agree.

When we asked our followers which they preferred, 77% of users responded saying they value the aesthetics over the rarity of a piece, and we’re not the only ones:

PopRank on Twitter: “Help us settle the age-old question — which is more important, rarity or aesthetics? 100 $LOOKS to one lucky person that votes in the poll 💚” / Twitter

https://twitter.com/ape_g4ng/status/1454865240470999040

Twitter

https://twitter.com/jakeudell/status/1440724226793611265

https://twitter.com/penguin_curator/status/1466547923139846156

As always, please stop by our Discord and have a chat with us! I love nothing more than hearing the experiences of others in this space.

LFG,

Ilan, Co-Founder of PopRank