A Beginner’s Guide To Scikit Learn — Implement Scikit Learn In Logistic Regression

Published in

Edureka

11 min readDec 4, 2017

In this article, we will be discussing Scikit learn in python. Before talking about Scikit learn, one must understand the concept of machine learning. With machine learning, you don’t have to gather your insights manually. You just need an algorithm and the machine will do the rest for you! Isn’t this exciting? Scikit learn is one of the attraction where we can implement machine learning using Python. It is a free machine learning library which contains simple and efficient tools for data analysis and mining purposes. I will take you through the following topics, which will serve as fundamentals for the upcoming blogs:

What Is Machine Learning?
Overview Of Scikit Learn
Installation
Use Case - Logistic Regression

What is Machine learning?

Machine learning is a type of artificial intelligence that allows software applications to learn from the data and become more accurate in predicting outcomes without human intervention. But how does that happen? For that, the machine needs to be trained on some data and based on that, it will detect a pattern to create a model. This process of gaining knowledge from the data and providing powerful insights is all about machine learning. Refer the below image to get a better understanding of its working:

Working of Machine Learning- Sci-Kit Learn Tutorial

Using the data, the system learns an algorithm and then uses it to build a predictive model. Later on, we adjust the model or we enhance the accuracy of the model using the feedback data. Using this feedback data, we tune the model and predict action on the new data set. We will be discussing a use case of one of the algorithm approach where we will train and test the data which will help you give a better sense of whether it will be a good fit for your particular problem or not.

Next, there are three types of machine learning:

Supervised Learning:

This is a process of an algorithm learning from the training dataset. Supervised learning is where you generate a mapping function between the input variable (X) and an output variable (Y) and you use an algorithm to generate a function between them. It is also known as predictive modeling which refers to a process of making predictions using the data. Some of the algorithms include Linear Regression, Logistic Regression, Decision tree, Random forest, and Naive Bayes classifier. We will be further discussing a use case of supervised learning where we train the machine using logistic regression.

Unsupervised Learning:

This is a process where a model is trained using information which is not labeled. This process can be used to cluster the input data in classes on the basis of their statistical properties. Unsupervised learning is also called as clustering analysis which means the grouping of objects based on the information found in the data describing the objects or their relationship. The goal is that objects in one group should be similar to each other but different from objects in another group. Some of the algorithms include K-means clustering, Hierarchical clustering etc.

Reinforcement learning:

Reinforcement learning is learning by interacting with space or an environment. An RL agent learns from the consequences of its actions, rather than from being taught explicitly. It selects its actions on basis of its past experiences (exploitation) and also by new choices (exploration).

Overview of Scikit Learn

Scikit learn is a library used to perform machine learning in Python. Scikit learn is an open source library which is licensed under BSD and is reusable in various contexts, encouraging academic and commercial use. It provides a range of supervised and unsupervised learning algorithms in Python. Scikit learn consists of popular algorithms and libraries. Apart from that, it also contains the following packages:

NumPy
Matplotlib
SciPy (Scientific Python)

To implement Scikit learn, we first need to import the above packages. You can download these two packages using the command line or if you are using PyCharm, you can directly install it by going to your setting in the same way you do it for other packages.

Next, in a similar manner, you have to import Sklearn. Scikit learn is built upon the SciPy (Scientific Python) that must be installed before you can use Scikit-learn. You can refer to this website to download the same. Also, install Scipy and wheel package if it’s not present, you can type in the below command:

pip install scipy

I have already downloaded and installed it, you can refer to the below screenshot for any confusion.

After importing the above libraries, let’s dig deeper and understand how exactly Scikit learn is used.

Scikit learn comes with sample datasets, such as iris and digits. You can import the datasets and play around with them. After that, you have to import SVM which stands for Support Vector Machine. SVM is a form of machine learning which is used to analyze data.

Let us take an example where we will take digits dataset and it will categorize the numbers for us, for example- 0 1 2 3 4 5 6 7 8 9. Refer to the code below:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svmdigits= datasets.load_digits()
print(digits.data)

Output –

[[ 0. 0. 5. ..., 0. 0. 0.]
 [ 0. 0. 0. ..., 10. 0. 0.]
 [ 0. 0. 0. ..., 16. 9. 0.]
 ..., 
 [ 0. 0. 1. ..., 6. 0. 0.]
 [ 0. 0. 2. ..., 12. 0. 0.]
 [ 0. 0. 10. ..., 12. 1. 0.]]

Here we have just imported the libraries, SVM, datasets and printed the data. It’s a long array of digits data where the data is stored. It gives access to the features that can be used to classify the digits samples. Next, you can also try some other operations such as target, images etc. Consider the example below:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svmdigits= datasets.load_digits()
print(digits.target)
print(digits.images[0])

Output –

[0 1 2 ..., 8 9 8]                  // target of the data
[[ 0. 0. 5. 13. 9. 1. 0. 0.]         // image of the data
 [ 0. 0. 13. 15. 10. 15. 5. 0.]
 [ 0. 3. 15. 2. 0. 11. 8. 0.]
 [ 0. 4. 12. 0. 0. 8. 8. 0.]
 [ 0. 5. 8. 0. 0. 9. 8. 0.]
 [ 0. 4. 11. 0. 1. 12. 7. 0.]
 [ 0. 2. 14. 5. 10. 12. 0. 0.]
 [ 0. 0. 6. 13. 10. 0. 0. 0.]]

As you can see above, the target digits and the image of the digits are printed. digits.target gives the ground truth for the digit dataset, that is the number corresponding to each digit image. Next, data is always a 2D array which has a shape (n_samples, n_features), although the original data may have had a different shape. But in the case of the digits, each original sample is an image of shape (8,8) and can be accessed using digits.image.

Learning and Predicting

Next, in Scikit learn, we have used a dataset (sample of 10 possible classes, digits from zero to nine) and we need to predict the digits when an image is given. To predict the class, we need an estimator which helps to predict the classes to which unseen samples belong. In Scikit learn, we have an estimator for classification which is a python object that implements the methods fit(x,y) and predict(T). Let’s consider the below example:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svmdigits= datasets.load_digits()                     // dataset
clf = svm.SVC(gamma=0.001, C=100)
print(len(digits.data))
x,y=digits.data[:-1],digits.target[:-1]            // train the data
clf.fit(x,y)
print('Prediction:', clf.predict(digits.data[-1])) //predict data
plt.imshow(digits.images[-1],cmap=plt.cm.gray_r, interpolation="nearest")
plt.show()Output:
1796
Prediction: [8]

In the above example, we had first found the length and loaded 1796 examples. Next, we have used this data as a learning data, where we need to test the last element and first negative element. Also, we need to check whether the machine has predicted the right data or not. For that, we had used Matplotlib where we had displayed the image of digits. So to conclude, you have digits data, you got the target, you fit and predict it and hence you’re good to go! It’s really quick and easy, isn’t it?

You can also visualize the target labels with an image, just refer to the below code:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svmdigits= datasets.load_digits()
# Join the images and target labels in a list
images_and_labels = list(zip(digits.images, digits.target))# for every element in the list
for index, (image, label) in enumerate(images_and_labels[:8]):
    # initialize a subplot of 2X4 at the i+1-th position
    plt.subplot(2, 4, index + 1)
    # Display images in all subplots
    plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
    # Add a title to each subplot
    plt.title('Training: ' + str(label))# Show the plot
plt.show()

Output –

As you can see in the above code, we have used the ‘zip’ function to join the images and target labels in a list and then save it into a variable, say images_and_labels. After that, we have indexed the first eight elements in a grid of 2 by 4 at each position. After that, we have just displayed the images with the help of Matplotlib and added the title as ‘training’.

Use Case — Prediction using Logistic Regression

Problem Statement — A car company has released a new SUV in the market. Using the previous data about the sales of their SUV’s, they want to predict the category of people who might be interested in buying this.

For this, let us see a dataset where I have UserId, gender, age, estimated salary and purchased as columns. This is just a sample dataset, you can download the entire dataset from here. Once we import the data in pyCharm, it looks somewhat like this.

Now let us understand this data. As you can see in the above dataset, we have categories such as id, gender, age etc. Now based on these categories, we are going to train our machine and predict the no. of purchases. So here, we have independent variables as ‘age’, ‘expected salary’ and dependent variable as ‘purchased’. Now we will apply supervised learning, i.e logistic regression algorithm to find out the number of purchase using the existing data.

First, let’s get an overview of logistic regression.

Logistic Regression — Logistic Regression produces results in a binary format which is used to predict the outcome of a categorical dependent variable. It is most widely used when the dependent variable is binary i.e, the number of available categories is two such as, the usual outputs of logistic regression are:

Yes and No
True and False
High and Low
Pass and Fail

Now to begin with the code, we will first import these libraries — Numpy, Matplotlib, and Pandas. It is pretty easy to import pandas in Pycharm by following the below steps:

Settings -> Add Package ->  Pandas -> Install

After this, we will import the dataset and separate dependent variable(purchased) and independent variable(age, salary) by:

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
print(X)
print(y)

The next step would be training and test the data. A common strategy is to take all the labeled data and split into training and testing subsets, which is usually taken with a ratio of 70–80% for training subset and 20–30% for the testing subset. Hence, we have created create Training and Testing sets using cross_validation.

from sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

We can also scale the input values for better performance using StandarScaler as shown below:

from sklearn.preprocessing import StandardScalersc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.transform(X_test)

Now we will create our Logistic Regression model.

from sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression(random_state = 0)classifier.fit(X_train, y_train

We can use this and predict the results of our test set.

y_pred = classifier.predict(X_test)

Now, we can check how many predictions were accurate and how many were not using confusion matrix. Let us define Y as positive instances and N as negative instances. The four outcomes are formulated in 2*2 confusion matrix, as represented below:

from sklearn.metrics import confusion_matrixcm = confusion_matrix(y_test, y_pred)print(cm)

Output –

[[65 3]
 [ 8 24]]

Next, based on our confusion matrix, we can calculate the accuracy. So in our above example, the accuracy would be:

= TP + TN / FN + FP

= 65+24 / 65 +3+ 8 + 24

=89%

We have done this manually! Now let us see how machine calculates the same for us, for that we have an inbuilt function ‘accuracy_score’ which calculates the accuracy and prints it, as shown below:

// import the function accuracy_score
from sklearn.metrics import accuracy_score    

print(accuracy_score(y_test, y_pred)*100)     // prints the accuracy

Output –

89.0

Hurray! We have thus successfully implemented logistic regression using Scikit learn with an accuracy of 89%.

With this, we have covered just one of the many popular algorithms python has to offer. We have covered all the basics of Scikit learn the library, so you can start practicing now. The more you practice the more you will learn.

If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Python and Data Science.

1. Python Tutorial
2. Python Programming Language
3. Python Functions
4. File Handling in Python
5. Python Numpy Tutorial
6. Python Pandas Tutorial
7. Matplotlib Tutorial
8. Tkinter Tutorial
9. Requests Tutorial
10. PyGame Tutorial
11. OpenCV Tutorial
12. Web Scraping With Python
13. PyCharm Tutorial
14. Machine Learning Tutorial
15. Linear Regression Algorithm from scratch in Python
16. Python for Data Science
17. Python Regex
18. Loops in Python
19. Python Projects
20. Machine Learning Projects
21. Arrays in Python
22. Sets in Python
23. Multithreading in Python
24. Python Interview Questions
25. Java vs Python
26. How To Become A Python Developer?
27. Python Lambda Functions
28. How Netflix uses Python?
29. What is Socket Programming in Python
30. Python Database Connection
31. Golang vs Python
32. Python Seaborn Tutorial
33. Python Career Opportunities

Originally published at www.edureka.co on December 4, 2017.