A Rough Analysis in regard to the Data Steam Sale in Cheap shark

Abigail Chen
INST414: Data Science Techniques
4 min readFeb 11, 2022

The non-obvious insight I would like to see from the data I extracted is whether the price drops in-game promotions correlate with game ratings. In particular, will the games that received high scores from authoritative media would keep their original prices or take only a slight sale price. Since my INST 414 group has achieved an agreement on a research direction for this semester’s group project, I hope that examining this insight will help our team with the rest of the assignments. Moreover, even though I am a gamer, I generally buy games with the mainstream, that is, I buy and play whatever game is popular. I believe that insight can also inform me better reference guidance for buying games.

The API data I grabbed came from a site called Cheap shark which provides information on game promotions. In the description of this site, they provide sales data for pc games from major sites such as Stream, GreenManGaming, and Humble Bundle. Because of steam’s overwhelming dominance, I only grabbed steam’s data for analysis. In addition, the main values I refer to include steam’s user’s rating level, Metacritic score, original price of the game, and sale price. By visualizing these variables, I could find the answers I wanted from the graphs.

Since I am new to API data scraping, my knowledge in this area mainly comes from the instructor’s lecture within the module and the code examples provided. For this reason, I only know how to retrieve this data by using a simple request library.

Although I had limited data to work with, the amount of data generated through crawling exceeded my expectations. Therefore, I initially ranked the variable by sorting through a large number of complex values and then looked at them in terms to see if any of these variables matched my requirements. Since most variables are numerical variables except the steam Rating text which is a categorical variable, I determined to generate a data frame to formalize the data and use some common statistical graphics and other data visualization methods for analysis such as bar chart, scatter plot, and line plot

The most common bug I encountered was finding the right data in the first place. Although the provided GitHub link contains a large number of categories, these APIs either require keys or cannot support JSON. Since my current knowledge in this area is limited, I tried to grab APIs from various sites and experienced numerous times where JSON was not supported. Even though some data can be extracted successfully, they could not be cleaned up and be formatted into a data frame. In addition, I also tried various seaborn types during the process of visualizing the data. Nevertheless, the most common situation I encountered was comparing several sets of variables that did not support a certain seaborn, or the values that matched a certain graph were densely distributed in the x-axis or y-axis. For example, “Failed to convert value(s) to axis” was the most common error message I have experienced. In addition, too much data leads to overcrowding of the generated graphs hence visually too much information is presented at once.

Firstly, I compared whether the steam rating text would correlate with the original price of the game by generating a bar chart. Due to steam’s mechanics, users tend to consider the game rating text as a primary reference before purchasing a game. Those games which tagged very positive or overwhelmingly positive are more like to get more attention. As we can see from this bar chart, there is no clear correlation between steam ratings and the original price of the game. However, some games that have relatively negative ratings are more expensive among overall data.

Secondly, I mainly looked at the similarity between steam’s user ratings and those of the authoritative site Metacritic from the scatter plot. In general, both steam users and authoritative media individuals have relatively consistent tastes in terms of higher positive ratings and lower negative ratings. Meanwhile, it seems to be a difference of opinion between the professional media and ordinary users about the games that are rated mixed on steam.

Finally, this line plot shows that there is no clear relationship between the percentage of steam user ratings and the game’s deal ranting. Both 0 user ratings and over 90 percent of ratings do not give the game’s deal rating a clear ranking improvement.

To summarize, since this is the first time I have used API scraping data for analysis, there are significant shortcomings in data selection, cleaning, and setup. Also, the data visualization and analysis exhibit limitations due to the data cleaning is not standardized enough. I believe that by gradually learning and practicing throughout the semester, this skill can benefit my career development in the future.

--

--