Recently a new method of estimating sales data for games on steam showed up on the site barter.vg. They were looking at achievement data and extrapolating how many users would be necessary to get the percents shown. For example, if an achievement had “50%” of players achieving it, that would imply at least 2 players. “33%” implies at least 3. If a game has both of those, it implies at least 6 players (3/6 for the 50%, 2/6 for the 33%).
This was super intriguing to me because I had thought about the same thing in the past when looking at achievement stats shortly after launch, but on the steam community page it rounds everything to 1 decimal place, so I concluded that it was a futile endeavor because any multiple of 1000 would fit all possible achievement percentages. But barter.vg had really accurate numbers, which seemed impossible.
This was brought up with a dev group I’m in, and it was quickly pointed out that if you get achievement data through steam’s API, you get 16 digits of precision instead! I set out to try and replicate barter.vg’s algorithm based on the description of it on their site, “Calculated by finding the lowest number of player that produces whole numbers of players for each achievement (percent achieved * all players)”.
So I got it working, with a simple brute force. Checked every possible whole number of sales up to a cap, and multiplied it by the achievement percentages. None of them exactly hit a whole number, so I had to set a threshold for what counts as a “whole number”. It worked for most games with less than a million sales, spitting out results that matched with what was reported on barter.vg. But any large selling games like Terraria or PUBG just gave garbage results, and looking at barter.vg they also didn’t report stats for those games. I set out to try and improve it further and get it working on huge games as well.
I figured if you get the sales numbers and achievement stats for a game, you should be able to divide them and get exactly the numbers that valve was reporting through their API. For example, according to this method the number of people who have gotten the Beat The Game Deathless achievement in my game The End is Nigh, is 8 out of 62587 (1 legit and 7 cheaters). Dividing those gives .01278220716762267%, but the steam API reports 0.012782207690179348%. They’re close but they don’t match exactly. Dividing them as 32 bit floats instead gives .012782207%, which again doesn’t match exactly.
You can’t attribute this to floating point error, I mean it literally is about floating point error, but floating point error doesn’t just make stuff give random outputs. It’s still predictable. The same division should result in the same output, and doing the same calculation that valve does on their backend should give the same output. Some more experimenting and I found out that if you do the division with floats, then convert the result to a double, that it EXACTLY equals what valve is reporting through their API. Like, double==double exact. No epsilons needed.
So I went and made a new brute force checker that “goes in reverse”, it tries to find pairs of numbers that give the stats reported by the steam API when divided in this way. This removes floating point error from the equation entirely, because whatever error is involved gets canceled out since both sides of it are doing the “same calculation”. This was EXTREMELY accurate, and even worked for games like PUBG and TF2 which have more players than can be accurately represented with a float.
The thing that’s interesting about this method is that it returns exact numbers. The old method SteamSpy was using required random sampling of user profiles and extrapolating the data to fit the whole steam audience from that. It was accurate but included some pretty big error bars, especially for games with low numbers of sales. The new achievement based method doesn’t have this weakness. There is no random sampling or uncertainty* here, the tool spits out the exact number valve is using to calculate achievement data, and it does it on a snapshot of data instantly instead of requiring collecting data over time. You don’t need a server to use this method, you can just run a simple script and get an answer in seconds.
(*there’s a chance for games with small numbers of achievements that all the achievement fractions reduce by a common factor. This is negligible, for games with 1 cheevo there’s a 30% chance it can reduce, it drops exponentially with each additional achievement. You can just check a few days in a row and grab the highest number if you wanna be certain)
The only caveat here is that this measures… whatever stat valve is collecting with their achievement data. It’s not quite “owners” and its not quite “players”, It doesn’t quite match up with the stat we have called “players” or “downloads” on our sales reports, it overestimates it by various amounts. I’m unsure if its collecting data for pirated copies or family sharing or whatever. The stat is still close enough to be basically just as useful as old SteamSpy was, possibly moreso since players is more useful than owners anyway when trying to figure out what people like.
I passed the code along to SteamSpy last weekend once I got it working so they could integrate it with their site, since I have no interest in building a competitor or anything, and SteamSpy was a really valuable industry tool before it got neutered in April. SteamSpy got the code working and its now being used on their site. I’ve also decided that it’s only fair to just open source this code entirely, so here’s a repository for it on GitHub. It includes the c++ that does the heavy lifting, and a python adapter for it that gets achievement data from steam and passes it off to the c++ code. Feel free to play around with it.
UPDATE (7/4): Looks like valve is rounding numbers on the API now so this method no longer works.