Geek Culture
Published in

Geek Culture

Data Analysis (DA) with Python

Concepts • Tools • Coding

DA and its Tools

Python3 and Jupyter Notebook: defined in the above image at high-level, if you’re beginner/new to the field and to ensure all in same page prior to its deep dive. DA is performed programmatically or via GUI tools using the historical/existing data. The process of getting to know your data is also known as EDA (Exploratory Data Analysis).

Scenarios of DA - for business, investment & research purposes and for building prediction models (Machine Learning/Deep Learning).

DA categories:

  1. Descriptive Analysis: what has happened based on historical data.
  2. Diagnostic Analysis: to know why something has happened.
  3. Predictive Analysis: analyze to know what would happen next.
  4. Prescriptive Analysis: helps for what should/could be done next.

Essential steps: Data Wrangling or Data cleaning - process of converting data’s raw format into another format, to make it ready for analysis and clean it by identifying & handling missing values, data formatting, data normalization, data binning and converting categorical values into numerical variables especially to build prediction models. These steps depends on datasets and scenarios (business / research / predictive modelling etc).

This article covers:

  • Environment & Prerequisites
  • Collecting & Importing dataset
  • View the rows/data
  • Understand data using methods: df.shape() , df.columns , df.info() or df.dtypes , df.describe() , df.head() , df.tail() , df.sample()
  • Deleting unwanted column/s
  • Renaming column/s
  • Formatting/Converting data types
  • Subsetting DataFrame
  • Saving the DataFrame
  • Data Visualization
  • Answer questions

Based on scenarios, DA will be performed programmatically with any of the GUI tools.

Other languages and GUI tools for DA - author’s image

SQL and R are alternative to Python, scripted via Jupyter Lab or IDEs (PyCharm, VS Code, Atom etc) and examples of tools: Tableau, Excel, Alteryx, Microsoft Power BI & Google Data Studio for Data Visualization (one of essential part of DA).

Another flavour of Jupyter Notebook is Jupyter Lab. FYI, Lab provides navigation menu and allows to open multiple .ipynb notebooks in the same window!

To set-up the environment for DA - install Python3 and Jupyter Notebook or Lab on your PC or Mac OS. Run these commands via terminal/CMD for Notebook or Lab:

# on PC/Windows OS:
> pip install jupyterlab
# OR
> pip install notebook
# on Mac OS:
$ pip3 install jupyterlab
# OR
$ pip3 install notebook

To start the server: if you install jupyterlab, run jupyter lab command!

# Same command for Mac or PC:
$ jupyter lab
# OR
$ jupyter notebook

Its homepage will open, in the default web-browser of your laptop/machine or copy the displayed (in terminal) localhost URL in the browser of your choice!

Alternatively, install them via Anaconda or Miniconda. However Anaconda installs Python, Notebook, Lab and few IDEs. Miniconda is small version, both are open-source!

Let’s begin the DA coding - dataset’s link can be found in the below screenshot or Resources section.

Objective: To answer the questions about Android mob apps.

  • Prerequisites: Python modules - Pandas, Numpy, Matplotlib & Seaborn.
# To install via Jupyter Notebook:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
# Note: Don't include '!' if you install them via terminal/CMD, rest are same.

Once executed in terminal or Notebook’s code cell, they get installed on your machine!

Import Pandas module, aliased as pd.
  • Now, import dataset into workspace after downloading/collecting the required dataset.

Jupyter supports wide-range of file formats. Provide the full location path of your data file in quotes. I have saved it in the root folder of the project thus just file name will suffice.

# To read & store dataset in Pandas object(variable):
playstore_df = pd.read_csv('googleplaystore.csv')

Pandas will fill the missing value as NaN (Not a Number). It can be removed if that row/column doesn’t impact on output!

Object is named as playstore_df here. You can name it accordingly! Now, find its number of rows & columns. Let’s begin:

# To get number of rows & columns:
playstore_df.shape()
# Outputs it in tuple:
(10841, 13)
# Number of rows:10841 & columns:13
  • To view the rows:
View the first 5 rows of dataset — df.head()
View the last 5 rows — df.tail()

Identify which variables/columns we need, to meet our objective of this analysis.

  • Understand data for its types and descriptive statistics:
Displays columns name, count of non-null values for each column and their data types.

Then check, if Pandas has translated all the variables into correct data types else they need to be transformed to its respective data type for the good results!

Pandas will translate non-numeric columns as object by default, as per its version 1.1.3: execute pd.__version__ to know which version you’ve got!

Observations after executing playstore_df.info():

  1. Price column to float and Last Updated to datetime type can be transformed.
  2. Rating, Type, Content Rating, Current Ver & Android Ver columns have got null values.

Execute df.dtypes to get just data types of columns. dtypes is an attribute not a method!

df.describe() returns descriptive statistical values for numerical columns by default. Here yet, we have only one such column i.e., Rating.

Count of non-null values, mean/average, std (standard deviation), min and max values and readings at 3 quartiles (25%, 50%, 75%) of Rating.

df.describe(include=[object]) for categorical columns:

Count of non-null values, number of unique values, top is most frequent value and freq is frequency of top value.
  • Genres and Category have equivalent info so Genres can be removed and will save it to playstore_updated_df:
Output: after executing the drop() method on Genres.
  • To rename below columns for better readability! Fetch, few rows to confirm the changes:
Updated DataFrame after renaming.
  • To format the data type of Number of Installs to numeric, remove + and , characters:
First 5 rows to check updated Number of Installs.

Now, convert Number of Installs by using to_numeric(), however by default it’s converted to float.

Updated Number of Installs with float after this execution.

Next, convert it to int type. Everytime no need to save output, to a variable!

Now Number of Installs is int data type.

Q: Which are the top 3 or 5 categories, based on number of apps:

FAMILY, GAME and TOOLS.

Q: Which is the most used version:

4.1+, 4.0.3+ and 4.0+

Sometimes it’s challenging to know the accurate result. For instance, above output’s Varies with device is not a specific version!

Q: To know how many apps are paid and free:

10039 apps are free, 800 are paid but one app with zero since it doesn’t have valid entry!

Q: How many number of installs for each app: pivot_table()comes in handy. This may be familiar if you’ve worked with data in MS Excel.

Displays first 20 apps based on index.

Type has got 1 null value, It can be removed with dropna() method.

pivot() doesn’t reshape the DataFrame if the specified column in the index param has repeated value.

  • Subsetting DataFrame: for further analysis and saved it to playstore_df_sub:
Subsetting the original DataFrame (DF).
  • Save the DF or for later use, Jupyter let’s to save .ipynb file in various formats. Can be saved to CSV or JSON formats based on requirements. This is to save it to .csv file:
playstore_df_sub.to_csv("playstore_df_subset.csv", index=False)

File playstore_df_subset.csv gets saved to root directory of the project.

  • Let’s dive into, Data Visualization (graphical representation of data):
Import these modules to generate graphs.

%matplotlib inline to display the graphs in same .ipynb file! rcParams attributes and set_style() to change the default values.

Following are Histogram, Bar graph and Count plot. Choose the graphs, based on the data and requirements, these are few basic ones to get started:

Histogram: graphical representation of the distribution of numerical data.
Bar graph: to visualize count of categorical values.
Count plot: to view the app counts of each category.

Another aspect is Wordcloud, especially for this type of data:

Install wordcloud module as below:

$ pip3 install wordcloud  # Mac OS> pip install wordcloud   # PC
Import wordcloud module and generate list of required categorical values.

Alternative code to generate the above list:

names = list(playstore_updated_df["Category"].value_counts().index)
Code snippet to generate cloud of words! and WordCloud image. Higher the frequency, larger the font.
  • Finally, answer the other questions from this analysis:

Q1: Which is the most used Android version? = 4.1and+

Q2: How many mob apps categories are there and which are they? = count is 34 and names are below.

Q3: Which are the top 3 categories depending on number of apps = FAMILY, GAME and TOOLS (check WordCloud image or code)!

Q4: Which apps have got 5-stars rating? = Highlighted below.

Tip: You can utilize the Google Colab or Kaggle. They offer free service for GPU powered processor for notebooks. Kaggle also provides a large repository of community published datasets. Below are the links, for further reading. They come in handy for learning & sharing purpose especially for study groups!

Resources:

Summary: sometimes brief on each column, data type and their units of measurement will be available!

If you’re looking to get started on DA using MySQL, in the articles:

  1. EDA with SQL, I have described its concepts & coding
  2. To Interact with MySQL Database to connect to MySQL server and database.

Hope you enjoyed the article as much as I enjoyed articulating it.. feel free to clap!!

Thank you!

--

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Recommended from Medium

Detailed Explanation of Panel Data — How to identify Balanced and unbalanced Panel Data.

Quick start on deploying your Dash application using Heroku on a custom domain

The Basics of Data Science: Data Acquisition

This blog post is part of the Udacity Data Scientists Nanodegree Program.

Smart Policing for Safer Cities: A Data-Driven Approach

Flood Security from Space

Why Tesla Won’t Use LIDAR

Building sensitivity atlases

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ramya N

Ramya N

Data Analysis, Web Development, Tech Writer & ML/DL/NLP Enthusiast | OSS/community projects from ideas to MVP launch | Code Mentor | Health & Fitness Influencer

More from Medium

Performing Analysis of Meteorological Data Using Python

Time Series Analysis with Python: A Tutorial for Beginners

10 Essential Python Libraries for Data Professionals

Decision Trees in Purchasing with Python