Efficiently Scraping Steam Game Reviews with Python: A Comprehensive Guide

Published in

CodeX

7 min readJun 15, 2024

Abstract

A Steam API is introduced to retrieve all reviews of a game and other information from Steam efficiently. Personal experience on interacting with it is discussed to program different error handling procedures. Extra functionality is implemented to scrape reviews within a fixed time-range.

Forewords

In our previous article, we discussed the process of using two lesser-known Steam APIs to create a customized dataset of all games available on Steam. In this article, we will delve into the effective scraping of comments using a Steam API to generate a personalized comment dataset for further analysis, such as counting ratio of positive and negative reviews, the distribution of length of reviews, and distribution of playtime. To achieve that, we will scrape all the comments of a few games using the API below.

https://store.steampowered.com/appreviews/{appid}

Python will be used for this project. requests packages will be used, which can be installed using pip or conda

# pip
pip install requests

# conda
conda install anaconda::requests

API introduction

This internal API functions as a GET request, providing comments left by different players for a specific game identified by the appid field. Theappid serves as a unique identifier for each game or application on Steam and can be found in the URL of every app page. For instance, to access the page for the game "Elden Ring," you can usehttps://store.steampowered.com/app/1245620/ELDEN_RING/,where the number following “app”, 1245620 , is the appid of the game.

According to the official documentation, the API provides different fields to control the response data format. Below are the available fields. Since we want to retrieve all comments of a game, we will go for the value “recent” of option “filter”. We also want to retrieve English only comments, thus we set the value of “language” option to “english”. Please refer to the table in the official documentation for detailed usage of the API.

The final parameters of these fields are listed as below. Default values are used for non-listed parameters. We will use the “json” parameter to specify the response format as JSON. Otherwise, HTML will be returned.

params = {
    'json':1,                                       # set the return result format to be json.
    'language': 'english',
    'cursor': '*',                                  # set the cursor to retrieve reviews from a specific "page"
    'num_per_page': 100,                            # retrieve more comments per request to reduce number of API calls
    'filter': 'recent'
}

To have a glance of the result of the API with the parameters above, Postman is used to test the API.

Successful result of calling the API in Postman.

If an invalid appid is inserted, the API will return the “default” response as shown below.

Empty successful result of the response of the API in Postman.

Scraping comments of a particular game on Steam

With a thorough understanding of the API, we can start developing the scraper for scraping reviews of different games. A function is defined to return call the API and provides basic error handling, ensuring a dictionary is returned for further processing in the main loop. Since the success field of the API will return 1 even if no user review is returned, a json with success = 2 is returned for other error handling.

def get_user_reviews(review_appid, params):

    user_review_url = f'https://store.steampowered.com/appreviews/{review_appid}'
    req_user_review = requests.get(
        user_review_url,
        params=params
    )

    if req_user_review.status_code != 200:
        print(f'Fail to get response. Status code: {req_user_review.status_code}')
        return {"success": 2}
    
    try:
        user_reviews = req_user_review.json()
    except:
        return {"success": 2}

    return user_reviews

Calling the API with the function above is straightforward.

review_appid = 1245620
params = {
        'json':1,
        'language': 'english',
        'cursor': '*',                                  # set the cursor to retrieve reviews from a specific "page"
        'num_per_page': 100,
        'filter': 'recent'
    }

reviews_response = get_user_reviews(review_appid, params)

However, we want to iterate the call until there is no further reviews available. By testing the API with games with only a few reviews, it is observed that there will be no cursor field, or the cursor field will be null if it is the last page available for the API. Therefore, we can make use of this property to create a loop that retrieves all the reviews of a game.

review_appid = 1245620        # the appid of game "Elden Ring"
params = {
        'json':1,
        'language': 'english',
        'cursor': '*',                                  # set the cursor to retrieve reviews from a specific "page"
        'num_per_page': 100,
        'filter': 'recent'
    }
    
selected_reviews = []


while (true):

  reviews_response = get_user_reviews(review_appid, params)
  
  # not success?
    if reviews_response["success"] != 1:
        print("Not a success")
        break
        
    if reviews_response["query_summary"]["num_reviews"] == 0:
      print("no_reviews.")
      break

  
  # extract each review in the response of the API call
  for review in reviews_response["reviews"]:
        # for brevity, the extraction is not included
        # my_review_dict = {}

        selected_reviews.append(my_review_dict)
        
        # go to next page
    try:
        cursor = reviews_response['cursor']         # cursor field does not exist, or = null in the last page
    except Exception as e:
        cursor = ''
        
    if not cursor:
      print("Reached the end of all comments.")
      break
      
  params["cursor"] = cursor
  print("To next page. Next page cursor Cursor:", cursor)

Retrieving reviews between a time

One may want to focus on only recent reviews for review analysis, such as reviews left in this year (2024). With that in mind, we can make use of the timestamp_created field in the returned json of each comment to filter comments. By setting the time-range of interested reviews with end_time and start_time (whereend_time > start_time), we can store the reviews falling within the time-range. Then we can save the list of reviews of interest in a pickle object for further exploratory data analysis. The full code is included as below

from datetime import datetime, timedelta
import requests
import pickle
from pathlib import Path

def get_user_reviews(review_appid, params):

    user_review_url = f'https://store.steampowered.com/appreviews/{review_appid}'
    req_user_review = requests.get(
        user_review_url,
        params=params
    )

    if req_user_review.status_code != 200:
        print(f'Fail to get response. Status code: {req_user_review.status_code}')
        return {"success": 2}
    
    try:
        user_reviews = req_user_review.json()
    except:
        return {"success": 2}

    return user_reviews
    
review_appname = "ELDEN RING"                              # the game name
review_appid = 1245620                                      # the game appid on Steam

# the params of the API
params = {
        'json':1,
        'language': 'english',
        'cursor': '*',                                  # set the cursor to retrieve reviews from a specific "page"
        'num_per_page': 100,
        'filter': 'recent'
    }

# time_interval = timedelta(hours=24)                         # the time interval to get the reviews
# end_time = datetime.fromtimestamp(1716718910)               # the timestamp in the return result are unix timestamp (GMT+0)
end_time = datetime.now()
# start_time = end_time - time_interval
start_time = datetime(2024, 1, 1, 0, 0, 0)

print(f"Start time: {start_time}")
print(f"End time: {end_time}")
print(start_time.timestamp(), end_time.timestamp())

passed_start_time = False
passed_end_time = False

selected_reviews = []

while (not passed_start_time or not passed_end_time):

    reviews_response = get_user_reviews(review_appid, params)

    # not success?
    if reviews_response["success"] != 1:
        print("Not a success")
        print(reviews_response)

    if reviews_response["query_summary"]['num_reviews'] == 0:
        print("No reviews.")
        print(reviews_response)

    for review in reviews_response["reviews"]:
        recommendation_id = review['recommendationid']
        
        timestamp_created = review['timestamp_created']
        timestamp_updated = review['timestamp_updated']

        # skip the comments that beyond end_time
        if not passed_end_time:
            if timestamp_created > end_time.timestamp():
                continue
            else:
                passed_end_time = True
                
        # exit the loop once detected a comment that before start_time
        if not passed_start_time:
            if timestamp_created < start_time.timestamp():
                passed_start_time = True
                break

        # extract the useful (to me) data
        author_steamid = review['author']['steamid']        # will automatically redirect to the profileURL if any
        playtime_forever = review['author']['playtime_forever']
        playtime_last_two_weeks = review['author']['playtime_last_two_weeks']
        playtime_at_review_minutes = review['author']['playtime_at_review']
        last_played = review['author']['last_played']

        review_text = review['review']
        voted_up = review['voted_up']
        votes_up = review['votes_up']
        votes_funny = review['votes_funny']
        weighted_vote_score = review['weighted_vote_score']
        steam_purchase = review['steam_purchase']
        received_for_free = review['received_for_free']
        written_during_early_access = review['written_during_early_access']

        my_review_dict = {
            'recommendationid': recommendation_id,
            'author_steamid': author_steamid,
            'playtime_at_review_minutes': playtime_at_review_minutes,
            'playtime_forever_minutes': playtime_forever,
            'playtime_last_two_weeks_minutes': playtime_last_two_weeks,
            'last_played': last_played,

            'review_text': review_text,
            'timestamp_created': timestamp_created,
            'timestamp_updated': timestamp_updated,

            'voted_up': voted_up,
            'votes_up': votes_up,
            'votes_funny': votes_funny,
            'weighted_vote_score': weighted_vote_score,
            'steam_purchase': steam_purchase,
            'received_for_free': received_for_free,
            'written_during_early_access': written_during_early_access,
        }

        selected_reviews.append(my_review_dict)

    # go to next page
    try:
        cursor = reviews_response['cursor']         # cursor field does not exist in the last page
    except Exception as e:
        cursor = ''

    # no next page
    # exit the loop
    if not cursor:
        print("Reached the end of all comments.")
        break
    
    # set the cursor object to move to next page to continue
    params['cursor'] = cursor
    print('To next page. Next page cursor:', cursor)
    
    
# save the selected reviews to a file

foldername = f"{review_appid}_{review_appname}"
filename = f"{review_appid}_{review_appname}_reviews_{start_time.strftime('%Y%m%d-%H%M%S')}_{end_time.strftime('%Y%m%d-%H%M%S')}.pkl"
output_path = Path(
    foldername, filename
)
if not output_path.parent.exists():
    output_path.parent.mkdir(parents=True)

pickle.dump(selected_reviews, open(output_path, 'wb'))

Detailed usage can be found in my Github

Future works

It is noticed that the cursor object remains unchanged even if newest reviews are created. (I.e. existing cursor string can be used to retrieve a specific set of reviews). Consequently, it is possible to save the cursor object as well and use it to identify newer reviews, thereby preventing scraping of duplicate reviews.

Ending/Conclusion

In this short article, we have explored how to efficiently and reliably use a single Steam API to retrieve reviews for a specific game. We have also implemented two additional features, namely proper error handling and date-range filtering, in our program. These features enable us to create a custom dataset for game review analysis.

This is Part 2 of the series of “Scraping Steam: A Comprehensive Guide (2024 ver)”.

Reference

The official documentation of the API: User Reviews — Get List (Steamworks Documentation)

An unofficial documentation of the API: Get App Reviews

Official available options for field “language” of the API (look for the column “API language code” instead of the “Web API language code”): Languages Supported on Steam (Steamworks Documentation)