Installation of Jupyter Notebook & getting started with Machine Learning
So let us now see what software do we require to code?
We recommend using the Anaconda distribution to install Jupyter Notebook.
Prior to this please make sure you have python IDLE installed in your system (For any of the computational works python versions- 2.7, 3.6 are highly recommended as they provide most compatible environment to import libraries without any issues). Navigate to the below link to install python 2.7
Python Release Python 2.7.15
Release Date: May 1, 2018 Python 2.7.15 is a bugfix release in the Python 2.7 series. Note Attention macOS users: as of…
Go to downloads>windows/Mac os>download 2.7x version
You can directly activate python version using Anaconda:
Anaconda is a free and open-source distribution of the Python and R programming languages that allow us to create different python environments, each with their own settings for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment.
Below is the link to download Anaconda application :
Creating Conda Environment :
conda create --name myenvironment
conda create -n myenvironment python=2.7
By following above steps in the Anaconda command prompt, we can activate the conda environment with python 2.7 version.
There are few applications which come along with Anaconda Navigator, they are shown below. Different applications are used in machine learning based on our requirement:
Jupyter Notebook runs code in many programming languages, Python is a requirement (Recommended: Python 2.7 or Python 3.6) for installing the Jupyter Notebook. It aims to create a better work experience for data scientists.
“The name Jupyter is an acronym which stands for the three languages it was designed for: JUlia , PYThon, and R.”
To learn more about jupyter notebook visit https://jupyter.org/
Steps to open jupyter notebook :
After the Installation let us move towards our first Step towards Machine Learning:
This is collection of data at one particular place. The data should be gathered from as many sources as possible.
Collecting data allows you to capture a series of past events so that we can analyse the data to detect recurring patterns. These patterns can be recognised by using various Data Visualization tools available to us by importing various predefined python libraries. From those patterns, we will build a trained predictive model which will also be tested for it’s performance using various machine learning algorithms depending on the inputs and outputs of the dataset .
Obviously, the data models will only be good if the data collected is good. Data collection practices play a vital role in developing high-performing models.
That data should be imported in excel sheet (.csv) file. Later,by keenly observing the data try to segregate the inputs and ouput. For instance let us examine the cancer data set the inputs will be all the parameters taken into consideration to predict the result and output can be which type of cancer it is supposed to be Malignant/Benign as shown below
Link to the data sets is given below:
For instance you can start with cancer data set : Download “cancer.csv” to
You can download the above cancer file in .csv format on your desktop.
To Run the code : Press shift+enter
[*] on the left shows executing state
 on the left shows it got executed successfully
(Try to run each & every line so as to notice changes)
NOTE: To install any library prior to importing :
conda install library_name — in Anaconda Command Prompt & press enter
eg: conda install opencv (library for face recognition)
!pip install library_name — in jupyter notebook & run it!
eg: !pip install opencv
The following code needs to be implemented line by line to analyse the data:
→ Importing os library from python
This module provides functions for interaction with the operating system, it provides a portable way of using os functions.
→ It gives the current working directory
→ To change the directory : os.chdir(“C:/Users/file name/Desktop”)
→ Change the directory location, to access anypath in server we use ‘/’ so to get access to that file we change all ‘\’ to ‘/’.
→ import numpy as np : Importing numerical python to perform arithmetic operations. It stores and manipulates data & makes advanced mathematical computations much easier.
→ import pandas as pd : This library is used to create data frames & also reading a .csv file.
— Pandas store the data in the form of data structures
— Designed to work with relational & labelled data which makes importing & analysing data sets much easier, supports data manipulations rearranging of data.
→ Displays the data from cancer.csv file
→ dataframe.shape() : In our eg- data1.shape() It will give you the rows and columns. Then simply write dataframe name(data1) to print the dataset.
CHECKING FOR VALUES:
Data in real world is usually in raw format which cannot be used for analyzing .We may find missing values in the process of data extraction .These are needed to be handled because they decrease the quality or indeed the performance in the performance metric. It can also lead to wrong analysis and conclusions drawn out of it . Thereby, being a threat to model’s correctness.
→ dataframe.describe() : It explains the detail of min, max, count 25% and much more.
The understanding of each attribute in the output is given below
→ The following syntax is used to check for NaN values in data set.
Syntax : dataframe.apply(lambda x:sum(x.isnull()),axis=0)
→ Pandas dataframe.cov() is used to compute pairwise covariance of its attributes.
It will simply ignore the columns bearing non-numeric or null (NaN) values.
→ Pandas dataframe.corr() is used to find the pairwise correlation of all it’s attributes in the dataframe. Even here non-numeric values are ignored but, null values are not taken into consideration.
NOTE: The correlation of any variable with itself always results in 1
Correlations provide the basic idea or draw a statistical relationship between the two variables ( i.e dependent and independent variables)
A correlation can be:
→ Positive: As the value of one variable increases, the value of other also increases and vise versa (Directly proportional)
→ Negative: As the value of one variable increases, the value of the other decreases and vise versa (Indirectly proportional).
→ why do we need covariance & correlation ? where do we use them??💭
As these insights play a crucial role in finding statistical relationships between variables, they will be sent as parameters to help plot graphs in future.
DATA VISUALIZATION TOOLS
import matplotlib.pyplot as plt
It is a collection of styling graphs(for visualizing data) which work based on the command functions used. It works like MATLAB. pyplot function is used to view some changes in the figure (i.e it is used to create a figure, setting plotting area, assigning labels to x & y axes of the figure, plotting lines and bar graphs, changing colors of the plotting lines etc,. )
→ Boxplot is a graphical representation of variables.
The middlemost line in the shown Boxplot fig denotes the median of the data. The upper and lower hinge denotes quartiles :
First quartile (Q1)- 25th Percentile: the middle number between the minimum value number and the median.
Median (Q2)- 50th Percentile: the middlemost value of the dataset.
Third quartile (Q3)- 75th Percentile: the middle value between the median and the maximum value of the dataset.
Interquartile range (IQR): 25th to 75th percentile( i.e between Q1 & Q3).
→ Boxplots describe a standard way of pictorially representing the distribution of data in the dataset.
→ It is also used to identify the outliers in the data
import seaborn as sns
The seaborn library is basically based on Matplotlib but with enhanced features which enable us to visualize data in a broader aspect. We can create more attractive and informative graphical representations.
If one is involved in viewing statistics then seaborn is a definite good choice as it contains many built-in functions which makes statistical tasks easier.
Seaborn will by default infer x-axis label and its ranges. It also has a default bin size to make a histogram. Besides, it can plot a density curve.
The Seaborn function to make histogram is “distplot” which stands for distribution plot. displot() takes an argument dataframe from pandas to construct a histogram as shown below.
A heatmap uses a unique style of representing graphical figures using a set of color-codes in order to display different values in the form of a color-matrix. Basically, we can customise our own colors to represent data.
It arranges the rows and columns to place similar values side by side.
Often used to in analytics to show user behaviour on specific webpages or webpage templates.
😃 Well, then that’s it for this blog.
Hope you liked it 👍
Any queries please feel free to as Renish Sundrani Kiran Lakhani
Will soon come up with the next blog, wherein we will extend this knowledge upto Machine Learning Algorithms and much more..✌
About me: This is my very first blog on machine learning using python.
As a data science enthusiast, I felt the mere urge to share my knowledge from the scratch to others in the domain. I would like to share my journey, my experiences and also the steps which I followed in analyzing a data set. There are numerous blogs and websites available for data science but I aim to reduce the complexity of the steps involved and codes implemented so that it aids people to have fun learning & easy understanding of machine learning such that they don’t give up.
You can also connect us via linkedIn- https://www.linkedin.com/in/renish-sundrani-6a748317a & https://www.linkedin.com/in/kiran-lakhani-20