Simple Data Analysis on Google Play Store App

Muhammad Kemal Hernandi
Analytics Vidhya
Published in
6 min readSep 5, 2021

Hi! Well, this is my first medium post. In this post, I would like to share my experience on data analysis which is useful to extract insights from raw dataset. This analysis is one of my learning project since I want to enhance my analytical skill set. For this project, I used some tools and programming languange such as python, SQL, Tableau, Orange, and infogram.com (word cloud).

photo source

Data Source Details

We can find the data source and details here. The dataset consists of 2 tables. The first table called ‘googleplaystore.csv’ consists of 13 columns ( app, category, rating, reviews, size, installs, type, price, content rating, genres). Meanwhile, the second table called ‘googleplaystore_user_reviews.csv’ consists of 5 columns (app, translated review, sentiment, sentiment polarity, sentiment subjectivity).

Data source description

In this article, we will be focusing on insights based on application categories.

Preprocessing Data

After getting the dataset, the next step that must be done is preprocessing text. Preprocessing text aims to clean raw data into assorted data that is ready for use. The data preprocessing process can consist of many things, changing the data type in the column, changing or manipulating empty columns, deleting data with duplicate contents, and others.

In the ‘googleplaystore1.csv’ data, it was found that several rows had more than one application name. The decision on which data to use can be consulted with the person collecting the dataset. In cases like the ‘8 Ball Pool’ below, we will make each application into one line by taking the highest amount of reviews.

8 Ball Pool duplicate rows

We can also ensure in advance which applications have rows with the same application name and calculate the total duplicates. From the data below, it was obtained a total of 1182 duplicate rows.

Duplicate count for each app

After clearing 1182 duplicate rows, we can check whether the process of deleting duplicate rows was successful.

Checking duplicate rows

After making sure that there are no duplicate rows, we have to re-check whether the data we have can be visualized or not. In this analysis, we will only use the app, category, rating, reviews, installs, and type columns. After re-checking the data we have, it turns out that there is one row that shows a value that does not match the column, so what we need to do is to delete that row.

Weird values

This preprocessing stage also applies to cleaning data in the ‘googleplaystore_user_reviews.csv’ file. After preprocessing the data, the ‘googleplaystore1.csv’ it was found that the file contains 9,659 rows with 6 columns (app, category, rating, reviews, installs, type). Meanwhile, the second file with the title ‘googleplaystore_user_reviews.csv’ contains 30,626 rows with 5 columns (App, translated review, sentiment, sentiment polarity, sentiment subject).

Findings

Family is found to be the category with the highest number of applications. The number of applications with the family category is even up to 98% more than the second highest category, namely games. The three categories with the highest number of applications are family, games, and tools.

Amount of apps base on categories

From the installs column, we can take the value 1,000,000,000+ to see which applications have been installed the most. From the 20 applications that are included in the 1,000,000,000+ installed, all of the applications are free and applications with communication category generate the highest amount of aplications. It can also be seen that out of 20 apps, the apps produced by Google are dominating the store.

Apps with 1,000,000,000+ installed

When we looked at the number of reviews, game is the category with the most reviews with 588,992,954 reviews. One of the reasons why the game category has the most reviews could be because of the large variety of applications. If we look back at the total applications sorted by category, games rank second after family. This can be proven if we divide the number of reviews to apps.

It can be seen that the social and communication category exceeds the number of reviews/apps from the game category. This proves that users tend to give more reviews to applications with social and communication categories compared to other categories.

We can also see although the family category have the biggest amount of applications, in terms of reviews/apps, it is only around 103,118 which is very small compared to others.

Amount of reviews
Amount of reviews / app

From the data ‘googleplaystore_user_reviews.csv’, we can take sentiment from the results of the reviews of each application. Because we focus on categories, we can combine the review data with the ‘googleplaystore.csv’ file to get the category of each application.

app, review, and sentiment column example

We can compare the positive and negative sentiments of each category by accumulating the number of positive and negative sentiments in each application into categories. The highest percentage of negative sentiment was achieved by the game category with 38.20%.

Positive vs Negative sentiment on Categories

We can use wordcloud to find out what words often appear in reviews based on their sentiments. We will create a wordcloud for the games category because the number of negative sentiments from the game category shows the largest percentage.

Using the wordcloud, we can find the same keywords both in negative and positive sentiments such as ‘spend money’, ‘real money’, and ’pay win ’. Even both sentiments contain the same keywords, but the context itself might be different.

From the wordcloud it shows that the negative sentiments consists some complaints about time (loading time, waste of time, takes a long time), advertisements( annoying ads, ads time too long, too many ads), money (not worth to pay , pay to win game) , and about the game itself (really hard, stuck on level, etc).

Meanwhile the positive sentiments consists a lot of compliment ( good, time killer, great, fun, addictive, etc) and the game itself ( good graphics, challenging).

Reviews wordcloud for positive and negative sentiment (game)

Conclusion:

  1. The highest amount of application base on categories is family, followed by game and tools.
  2. There are 20 applications that is installed more than 1 billion times (the highest value), 6 of them are in communication category.
  3. The highest amount of reviews in total is achieved by game category, but the highest amount of reviews/app is achieved by social. It means user is likely to review social applications compare to other categories.
  4. Even the highest amount of application is family category, but the achieved reviews for the app is very small.
  5. The highest percentage of negative sentiment was achieved by the game category.
  6. The negative sentiments about game category consists of some complaints about time, the advertisements, money, and the poor performance of the game.
  7. The positive sentiments about game category consists a lot of compliment and the good performance of the game.

I will be very happy to discuss and accept any suggestions about the analysis since I’m still learning and still have a long way to go, please reach me through https://www.linkedin.com/in/muhammad-kemal-hernandi/. Thank u so much!!!

--

--

Muhammad Kemal Hernandi
Analytics Vidhya

Bsc of Telecommunication Engineering | Passionate in Data Analytics and Business Intelligence| linkedin: https://www.linkedin.com/in/muhammad-kemal-hernandi/