Data Analysis (DA) with Python
Concepts • Tools • Coding
Python3 and Jupyter Notebook: defined in the above image at high-level, if you’re beginner/new to the field and to ensure all in same page prior to its deep dive. DA is performed programmatically or via GUI tools using the historical/existing data. The process of getting to know your data is also known as EDA (Exploratory Data Analysis).
Scenarios of DA - for business, investment & research purposes and for building prediction models (Machine Learning/Deep Learning).
- Descriptive Analysis: what has happened based on historical data.
- Diagnostic Analysis: to know why something has happened.
- Predictive Analysis: analyze to know what would happen next.
- Prescriptive Analysis: helps for what should/could be done next.
Essential steps: Data Wrangling or Data cleaning - process of converting data’s raw format into another format, to make it ready for analysis and clean it by identifying & handling missing values, data formatting, data normalization, data binning and converting categorical values into numerical variables especially to build prediction models. These steps depends on datasets and scenarios (business / research / predictive modelling etc).
This article covers:
- Environment & Prerequisites
- Collecting & Importing dataset
- View the rows/data
- Understand data using methods:
- Deleting unwanted column/s
- Renaming column/s
- Formatting/Converting data types
- Subsetting DataFrame
- Saving the DataFrame
- Data Visualization
- Answer questions
Based on scenarios, DA will be performed programmatically with any of the GUI tools.
SQL and R are alternative to Python, scripted via Jupyter Lab or IDEs (PyCharm, VS Code, Atom etc) and examples of tools: Tableau, Excel, Alteryx, Microsoft Power BI & Google Data Studio for Data Visualization (one of essential part of DA).
Another flavour of Jupyter Notebook is Jupyter Lab. FYI, Lab provides navigation menu and allows to open multiple .ipynb notebooks in the same window!
To set-up the environment for DA - install Python3 and Jupyter Notebook or Lab on your PC or Mac OS. Run these commands via terminal/CMD for Notebook or Lab:
# on PC/Windows OS:
> pip install jupyterlab
> pip install notebook# on Mac OS:
$ pip3 install jupyterlab
$ pip3 install notebook
To start the server: if you install jupyterlab, run jupyter lab command!
# Same command for Mac or PC:
$ jupyter lab
$ jupyter notebook
Its homepage will open, in the default web-browser of your laptop/machine or copy the displayed (in terminal) localhost URL in the browser of your choice!
Alternatively, install them via Anaconda or Miniconda. However Anaconda installs Python, Notebook, Lab and few IDEs. Miniconda is small version, both are open-source!
Let’s begin the DA coding - dataset’s link can be found in the below screenshot or Resources section.
Objective: To answer the questions about Android mob apps.
- Prerequisites: Python modules - Pandas, Numpy, Matplotlib & Seaborn.
# To install via Jupyter Notebook:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn# Note: Don't include '!' if you install them via terminal/CMD, rest are same.
Once executed in terminal or Notebook’s code cell, they get installed on your machine!
- Now, import dataset into workspace after downloading/collecting the required dataset.
Jupyter supports wide-range of file formats. Provide the full location path of your data file in quotes. I have saved it in the root folder of the project thus just file name will suffice.
# To read & store dataset in Pandas object(variable):
playstore_df = pd.read_csv('googleplaystore.csv')
Pandas will fill the missing value as NaN (Not a Number). It can be removed if that row/column doesn’t impact on output!
Object is named as playstore_df here. You can name it accordingly! Now, find its number of rows & columns. Let’s begin:
# To get number of rows & columns:
playstore_df.shape()# Outputs it in tuple:
(10841, 13)# Number of rows:10841 & columns:13
- To view the rows:
Identify which variables/columns we need, to meet our objective of this analysis.
- Understand data for its types and descriptive statistics:
Then check, if Pandas has translated all the variables into correct data types else they need to be transformed to its respective data type for the good results!
Pandas will translate non-numeric columns as object by default, as per its version 1.1.3: execute
pd.__version__ to know which version you’ve got!
Observations after executing
- Price column to float and Last Updated to datetime type can be transformed.
- Rating, Type, Content Rating, Current Ver & Android Ver columns have got null values.
df.dtypesto get just data types of columns. dtypes is an attribute not a method!
df.describe() returns descriptive statistical values for numerical columns by default. Here yet, we have only one such column i.e., Rating.
df.describe(include=[object]) for categorical columns:
- Genres and Category have equivalent info so Genres can be removed and will save it to playstore_updated_df:
- To rename below columns for better readability! Fetch, few rows to confirm the changes:
- To format the data type of Number of Installs to numeric, remove + and , characters:
Now, convert Number of Installs by using
to_numeric(), however by default it’s converted to float.
Next, convert it to int type. Everytime no need to save output, to a variable!
Q: Which are the top 3 or 5 categories, based on number of apps:
Q: Which is the most used version:
Sometimes it’s challenging to know the accurate result. For instance, above output’s Varies with device is not a specific version!
Q: To know how many apps are paid and free:
Q: How many number of installs for each app:
pivot_table()comes in handy. This may be familiar if you’ve worked with data in MS Excel.
Type has got 1 null value, It can be removed with
pivot() doesn’t reshape the DataFrame if the specified column in the index param has repeated value.
- Subsetting DataFrame: for further analysis and saved it to playstore_df_sub:
- Save the DF or for later use, Jupyter let’s to save .ipynb file in various formats. Can be saved to CSV or JSON formats based on requirements. This is to save it to .csv file:
File playstore_df_subset.csv gets saved to root directory of the project.
- Let’s dive into, Data Visualization (graphical representation of data):
%matplotlib inline to display the graphs in same .ipynb file! rcParams attributes and
set_style() to change the default values.
Following are Histogram, Bar graph and Count plot. Choose the graphs, based on the data and requirements, these are few basic ones to get started:
Another aspect is Wordcloud, especially for this type of data:
wordcloud module as below:
$ pip3 install wordcloud # Mac OS> pip install wordcloud # PC
Alternative code to generate the above list:
names = list(playstore_updated_df["Category"].value_counts().index)
- Finally, answer the other questions from this analysis:
Q1: Which is the most used Android version? = 4.1and+
Q2: How many mob apps categories are there and which are they? = count is 34 and names are below.
Q3: Which are the top 3 categories depending on number of apps = FAMILY, GAME and TOOLS (check WordCloud image or code)!
Q4: Which apps have got 5-stars rating? = Highlighted below.
Tip: You can utilize the Google Colab or Kaggle. They offer free service for GPU powered processor for notebooks. Kaggle also provides a large repository of community published datasets. Below are the links, for further reading. They come in handy for learning & sharing purpose especially for study groups!
- Kaggle | Kaggle open source datasets
- Google Playstore Dataset
- Jupyter Notebook or Lab
- Google Colab
- Anaconda | Miniconda
Summary: sometimes brief on each column, data type and their units of measurement will be available!
If you’re looking to get started on DA using MySQL, in the articles:
- EDA with SQL, I have described its concepts & coding
- To Interact with MySQL Database to connect to MySQL server and database.
Hope you enjoyed the article as much as I enjoyed articulating it.. feel free to clap!!