Scraping information of all games from Steam with Python

Published in

CodeX

10 min readMay 30, 2024

Abstract

Two lesser-known Steam API are introduced to retrieve all games and their information from Steam efficiently. Personal experience and techniques for efficient scraping, such as implementing checkpoints, and utilizing external resources are discussed. Extra resources in finding the suitable API are discussed at the end of this article such that readers can rely on structured data response for further data analysis instead of dealing with the HTML web page directly.

Foreword

Steam has long been the go-to platform for gamers worldwide, offering a vast collection of games ranging from popular triple-A titles to indie gems. The feedback provided by the gaming community plays a crucial role in shaping both the purchasing decisions of potential players and the game development process for developers. For instance, by analysing player comments, developers can to response players’ need swiftly by prioritizing fixes and features development. Hence there has been ongoing comment analysis in analyzing comments from game players.

While multiple resources and articles mentioned different web-scraping approaches to establish custom comment dataset, they are often scattered around the internet and relied on HTML processing instead of the available Steam APIs. Moreover, the messy official documentation of these publicly available Steam APIs lacks clarity and hinders their effective usage. In this series of articles, we will explore various useful yet lesser-known APIs within Steam and document their usage with my personal experience for interacting with the platform.

In this article, our goal is to build a dataset regarding all the games on Steam for further data analysis. This dataset will enable us to perform data analysis tasks such as identifying popular game genres and categories, as well as understanding the overall distribution of categories on Steam. To achieve that, we will utilize two APIs with different namespaces, each serving a unique purpose.

Python will be used for this project. requests package will be used, which can be installed through pip or conda .

# pip
pip install requests

# conda
conda install anaconda::requests

Retrieving a List of All Games on Steam

To begin scraping information about all games on Steam, we need to retrieve a comprehensive list of available games. Thankfully, it can be achieved through a single convenient API, https://api.steampowered.com/ISteamApps/GetAppList/v2/ .

According to the official documentation, it performs a GET request and returns a list of games with their Steam ID, namely appid, and the name of that game. The appid serves as a unique identifier for each game on Steam and will be used to retrieve detailed information.

Screenshot or running the request with Postman.

The API can be called with Python as below. Extra error handling is performed if it returns an unsuccessful status code.

req = requests.get("https://api.steampowered.com/ISteamApps/GetAppList/v2/")

# if the request is unsuccessful
if (req.status_code != 200):
    print_log("Failed to get all games on steam.")
    return

After initiating the request, the body of the response of the request can be accessed through the req object. After further inspection to the body, the list of appids can be extracted by visiting every JSON object in applist[‘apps’]. The complete function to get all appid is listed as below.

def get_all_app_id():
    # get all app id
    req = requests.get("https://api.steampowered.com/ISteamApps/GetAppList/v2/")

    if (req.status_code != 200):
        print_log("Failed to get all games on steam.")
        return
    
    try:
        data = req.json()
    except Exception as e:
        traceback.print_exc(limit=5)
        return {}
    
    apps_data = data['applist']['apps']

    apps_ids = []

    for app in apps_data:
        appid = app['appid']
        name = app['name']
        
        # skip apps that have empty name
        if not name:
            continue

        apps_ids.append(appid)

    return apps_ids

Scraping Information for Each Game

With the list of appid from Steam, we can now proceed to scrape detailed information for each game on Steam. Fortunately, there is an internal API to perform this action, https://store.steampowered.com/api/appdetails . By sending a GET request to this API with a specific appid, information of a game, such as free-to-play or not, category, genres, release date, description, PC requirements, supporting platform, and more, can be retrieved from Steam. An appid is required to initiate the request, for instance, the following URL provides information about Apex Legends https://store.steampowered.com/api/appdetails?appids=1172470

Screenshot of running the request with Postman. Successful result is returned, which means there exists such game on Steam.

Screenshot of running the request with Postman. Unsuccessful result is returned, which means the game cannot be accessed due to other reasons, such as removed from Steam or regional lockout.

Performing the request and retrieving all the information in Python are also simple.

appdetails_req = requests.get(f"https://store.steampowered.com/api/appdetails?appids={appid}")       # appid is a variable storing the `appid` of a game, retrieved in aforementioned section
# check whether a successful response is received.
if appdetails_req.status_code == 200:
  appdetails = appdetails_req.json()
  appdetails = appdetails[str(appid)]

It is important to note that there is a rate limit of around 200 successful requests per 5 minutes to prevent overloading the Steam server. Therefore, we need to implement additional condition handling when a request returns a status code of 429 (Too Many Requests) or 403 (Forbidden). In the case of a 429 status code, the program can attempt to resend the request every 10 seconds until a successful response is received. For a 403 status code, the program should wait for 5 minutes to comply with the rate limit duration.

The request is further wrapped with try-except block to handle potential errors, such as failed decoding of the JSON response body.

while len(apps_remaining_deque) > 0:            # apps_remaining_deque is a deque of appid
  appid = apps_remaining_deque.popleft()
  try:
      appdetails_req = requests.get(f"https://store.steampowered.com/api/appdetails?appids={appid}")

      if appdetails_req.status_code == 200:
          # .json() method will fail if the body is not a JSON object
          # for instance, the game with the appid does not exist anymore.
          appdetails = appdetails_req.json()
          appdetails = appdetails[str(appid)]

      elif appdetails_req.status_code == 429:
          print_log(f'Too many requests. Put App ID {appid} back to deque. Sleep for 10 sec')
          apps_remaining_deque.appendleft(appid)
          time.sleep(10)
          continue


      elif appdetails_req.status_code == 403:
          print_log(f'Forbidden to access. Put App ID {appid} back to deque. Sleep for 5 min.')
          apps_remaining_deque.append(appid)
          time.sleep(5 * 60)
          continue

      else:
          print_log("ERROR: status code:", appdetails_req.status_code)
          print_log(f"Error in App Id: {appid}. Put the app to error apps list.")
          error_apps_list.append(appid)        # error_app_list is a list for storing appids with error
          continue
        
   except:
     print_log(f"Error in decoding app details request. App id: {appid}")

     traceback.print_exc(limit=5)
     appdetails = {'success':False}        # identical to the unsuccessful result with successful response (HTTP code = 200).
     print()

Implementing Checkpoints for Scraping Efficiency

Since there are nearly 200K games on Steam, scraping all of them requires roughly 3 to 4 days (assuming 200 games per 5 minutes). To prevent complete starting over in case of unexpected program termination or errors, extra save-load (SL) features are implemented to save current progress and reload from existing checkpoints.

In particular, during execution, the list of accessed appid are categorized into three parts.

apps_dict: A dictionary storing the information of scraped games, in which the key is the appid and the value is the returned information from the /api/appdetails API
excluded_apps_list: A list of appid which returned unsuccessful body with successful response (HTTP code = 200)
error_apps_list : A list storing appid which failed to access due to unresolvable error.

Every 2500 access (which roughly equal to 1Hr runtime), these three lists will be serialized as pickle object and saved to hard disk. When the scraper (python program) runs, it first checks for presence of checkpoints, and loads them to memory to continue from where it left off.

Source Code

from collections import deque
from datetime import datetime
import os
import time
import requests
import json

import pickle
from pathlib import Path

import traceback

def print_log(*args):
    print(f"[{str(datetime.now())[:-3]}] ", end="")
    print(*args)

def get_all_app_id():
    # get all app id
    req = requests.get("https://api.steampowered.com/ISteamApps/GetAppList/v2/")

    if (req.status_code != 200):
        print_log("Failed to get all games on steam.")
        return
    
    try:
        data = req.json()
    except Exception as e:
        traceback.print_exc(limit=5)
        return {}
    
    apps_data = data['applist']['apps']

    apps_ids = []

    for app in apps_data:
        appid = app['appid']
        name = app['name']
        
        # skip apps that have empty name
        if not name:
            continue

        apps_ids.append(appid)

    return apps_ids



def save_checkpoints(checkpoint_folder, apps_dict_filename_prefix, exc_apps_filename_prefix, error_apps_filename_prefix, apps_dict, excluded_apps_list, error_apps_list):
    if not checkpoint_folder.exists():
        checkpoint_folder.mkdir(parents=True)

    save_path = checkpoint_folder.joinpath(
        apps_dict_filename_prefix + f'-ckpt-fin.p'
    ).resolve()

    save_path2 = checkpoint_folder.joinpath(
        exc_apps_filename_prefix + f'-ckpt-fin.p'
    ).resolve()
    
    save_path3 = checkpoint_folder.joinpath(
        error_apps_filename_prefix + f'-ckpt-fin.p'
    ).resolve()

    save_pickle(save_path, apps_dict)
    print_log(f'Successfully create app_dict checkpoint: {save_path}')

    save_pickle(save_path2, excluded_apps_list)
    print_log(f"Successfully create excluded apps checkpoint: {save_path2}")

    save_pickle(save_path3, error_apps_list)
    print_log(f"Successfully create error apps checkpoint: {save_path3}")

    print()


def load_pickle(path_to_load:Path) -> dict:
    obj = pickle.load(open(path_to_load, "rb"))
    
    return obj

def save_pickle(path_to_save:Path, obj):
    with open(path_to_save, 'wb') as handle:
        pickle.dump(obj, handle, protocol=pickle.HIGHEST_PROTOCOL)

def check_latest_checkpoints(checkpoint_folder, apps_dict_filename_prefix, exc_apps_filename_prefix, error_apps_filename_prefix):
    # app_dict
    all_pkl = []

    # get all pickle files in the checkpoint folder    
    for root, dirs, files in os.walk(checkpoint_folder):
        all_pkl = list(map(lambda f: Path(root, f), files))
        all_pkl = [p for p in all_pkl if p.suffix == '.p']
        break
            
    # create a list to store all the checkpoint files
    # then sort them
    # the latest checkpoint file for each of the object is the last element in each of the lists
    apps_dict_ckpt_files = [f for f in all_pkl if apps_dict_filename_prefix in f.name and "ckpt" in f.name]
    exc_apps_list_ckpt_files = [f for f in all_pkl if exc_apps_filename_prefix in f.name and "ckpt" in f.name]
    error_apps_ckpt_files = [f for f in all_pkl if error_apps_filename_prefix in f.name and 'ckpt' in f.name]

    apps_dict_ckpt_files.sort()
    exc_apps_list_ckpt_files.sort()
    error_apps_ckpt_files.sort()

    latest_apps_dict_ckpt_path = apps_dict_ckpt_files[-1] if apps_dict_ckpt_files else None
    latest_exc_apps_list_ckpt_path = exc_apps_list_ckpt_files[-1] if exc_apps_list_ckpt_files else None
    latest_error_apps_list_ckpt_path = error_apps_ckpt_files[-1] if error_apps_ckpt_files else None

    return latest_apps_dict_ckpt_path, latest_exc_apps_list_ckpt_path, latest_error_apps_list_ckpt_path

def main():
    print_log("Started Steam scraper process", os.getpid())


    apps_dict_filename_prefix = 'apps_dict'
    exc_apps_filename_prefix = 'excluded_apps_list'
    error_apps_filename_prefix = 'error_apps_list'

    apps_dict = {}
    excluded_apps_list = []
    error_apps_list = []

    all_app_ids = get_all_app_id()

    print_log('Total number of apps on steam:', len(all_app_ids))

    # path = project directory (i.e. steam_data_scraping)/checkpoints
    checkpoint_folder = Path('checkpoints').resolve()

    print_log('Checkpoint folder:', checkpoint_folder)

    if not checkpoint_folder.exists():
        print_log(f'Fail to find checkpoint folder: {checkpoint_folder}')
        print_log(f'Start at blank.')

        checkpoint_folder.mkdir(parents=True)

    latest_apps_dict_ckpt_path, latest_exc_apps_list_ckpt_path, latest_error_apps_list_ckpt_path = check_latest_checkpoints(checkpoint_folder, apps_dict_filename_prefix, exc_apps_filename_prefix, error_apps_filename_prefix)

    if latest_apps_dict_ckpt_path:
        apps_dict = load_pickle(latest_apps_dict_ckpt_path)
        print_log('Successfully load apps_dict checkpoint:', latest_apps_dict_ckpt_path)
        print_log(f'Number of apps in apps_dict: {len(apps_dict)}')
    
    if latest_exc_apps_list_ckpt_path:
        excluded_apps_list = load_pickle(latest_exc_apps_list_ckpt_path)
        print_log("Successfully load excluded_apps_list checkpoint:", latest_exc_apps_list_ckpt_path)
        print_log(f'Number of apps in excluded_apps_list: {len(excluded_apps_list)}')

    if latest_error_apps_list_ckpt_path:
        error_apps_list = load_pickle(latest_error_apps_list_ckpt_path)
        print_log("Successfully load error_apps_list checkpoint:", latest_error_apps_list_ckpt_path)
        print_log(f'Number of apps in error_apps_list: {len(error_apps_list)}')

    # remove app_ids that already scrapped or excluded or error
    all_app_ids = set(all_app_ids) \
            - set(map(int, set(apps_dict.keys()))) \
            - set(map(int, excluded_apps_list)) \
            - set(map(int, error_apps_list))
        
    # first get remaining apps
    apps_remaining_deque = deque(set(all_app_ids))

    
    print('Number of remaining apps:', len(apps_remaining_deque))

    i = 0
    while len(apps_remaining_deque) > 0:
        appid = apps_remaining_deque.popleft()

        # test whether the game exists or not
        # by making request to get the details of the app
        try:
            appdetails_req = requests.get(f"https://store.steampowered.com/api/appdetails?appids={appid}")

            if appdetails_req.status_code == 200:
                appdetails = appdetails_req.json()
                appdetails = appdetails[str(appid)]

            elif appdetails_req.status_code == 429:
                print_log(f'Too many requests. Put App ID {appid} back to deque. Sleep for 10 sec')
                apps_remaining_deque.appendleft(appid)
                time.sleep(10)
                continue


            elif appdetails_req.status_code == 403:
                print_log(f'Forbidden to access. Put App ID {appid} back to deque. Sleep for 5 min.')
                apps_remaining_deque.appendleft(appid)
                time.sleep(5 * 60)
                continue

            else:
                print_log("ERROR: status code:", appdetails_req.status_code)
                print_log(f"Error in App Id: {appid}. Put the app to error apps list.")
                error_apps_list.append(appid)
                continue
                
        except:
            print_log(f"Error in decoding app details request. App id: {appid}")

            traceback.print_exc(limit=5)
            appdetails = {'success':False}
            print()

        # not success -> the game does not exist anymore
        # add the app id to excluded app id list
        if appdetails['success'] == False:
            excluded_apps_list.append(appid)
            print_log(f'No successful response. Add App ID: {appid} to excluded apps list')
            continue

        appdetails_data = appdetails['data']

        appdetails_data['appid'] = appid     

        apps_dict[appid] = appdetails_data
        print_log(f"Successfully get content of App ID: {appid}")

        i += 1
        # for each 2500, save a ckpt
        if i >= 2500:
            save_checkpoints(checkpoint_folder, apps_dict_filename_prefix, exc_apps_filename_prefix, error_apps_filename_prefix, apps_dict, excluded_apps_list, error_apps_list)
            i = 0

    # save checkpoints at the end
    save_checkpoints(checkpoint_folder, apps_dict_filename_prefix, exc_apps_filename_prefix, error_apps_filename_prefix, apps_dict, excluded_apps_list, error_apps_list)

    print_log(f"Total number of valid apps: {len(apps_dict)}")
    print_log(f"Total number of skipped apps: {len(excluded_apps_list)}")
    print_log(f"Total number of error apps: {len(error_apps_list)}")

    print_log('Successful run. Program Terminates.')

if __name__ == '__main__':
    main()

Detailed usage of the scraper can be found in my Github.

Utilizing additional resources

Finding the right API can be tedious and troublesome due to the lack of comprehensive documentation by Steam. Fortunately, the community has been maintaining different services and documentations to guide future professionals in interacting with them

https://github.com/Revadike/InternalSteamWebAPI : The Github repo documented different internal Steam web API for with great details, which is a valuable resource for professionals who want to extract information from Steam’s database without going through their HTML pages. I referenced to the documentation of the /api/appdetails API during developing this scraping program.
Steam Web API Documentation and Tester : A static web page built by @xPaw which displays all available official Steam Web API from their server.

Future works

Exploratory Data Analysis (EDA) can be performed on the list of games on Steam.

Since there is no single API to retrieve all available category and genre on Steam, a custom list of category and genre can be created by visiting all apps on the list we created. Then, distribution on number of games in different genres can be further explored.

For example, in fact there are over 40 official category tags, and over 30 official genre tags on Steam.

Closing

By leveraging the power of Python and Steam’s APIs, we can unlock valuable insights into the world of gaming and make data-driven decisions. Stay tuned for upcoming articles in this series, where we will delve into more advanced topics and explore further possibilities for analysis.

This is Part 1 of the series of “Scraping Steam: A Comprehensive Guide (2024 ver)”.

References

Multiple Stack Overflow threads had discussed the issue of accessing information of all games from Steam. One of them is listed below.

Steam API all games — Stack Overflow

The official documentation of GetAppList API from Steam: ISteamApps Interface (Steamworks Documentation)

An unofficial documentation of /api/appdetails API maintained by the community. The API provides extra fields to control its response, such as language and countrycode, which are well-documented in the community-based documentation: https://github.com/Revadike/InternalSteamWebAPI/wiki/Get-App-Details