Day (9) — Machine Learning — Using LinearRegression with scikit-learn

Keith Brooks
6 min readMar 21, 2018

--

“A person reading a book with a magnifying glass and a pen in hand” by João Silas on Unsplash

This article covers work from the Python for Data Science and Machine Learning Bootcamp course on Udemy by Jose Portilla and helpful tips along the way. This course was very helpful in gaining a base understanding of the topic.

We’re Back!!! But, this will not be a dinosaur story. We left off with learning how to use choropleth maps with plotly. It is time to get our hands dirty with this thing called machine learning. One way to view machine learning is as a method that allows computers to discover insights regarding a topic without being explicitly programmed. It is a method of data analysis that automates analytical model building. Some of the ‘use cases’ of this are for the following:

  • Fraud detection
  • Equipment failure prediction
  • Sentiment analysis
  • Image recognition and much more.

Overview of Machine Learning Process:
This process may be split up into five phases. The first is the acquisition of data. Think of the old saying “Trash in, Trash out”.

Photo by Hermes Rivera on Unsplash

There are free and paid services that offer datasets. Kaggle has some great datasets to get started and competitions to obtain great hands-on learning. The next step is data cleaning. Here we can use skills from pandas, numpy and others to prepare the data for training. Once the data has been cleaned we can split it into a training and a test data set. We then proceed to fit (i.e. train) the model on the training data set portion. Once the model has been trained we use the test data set to evaluate the model and iterate over the process until the desired results are achieved. At the completion of a satisfactory model, we deploy it.

Main types of Machine Learning:
These include supervised, unsupervised and reinforcement learning. Supervised consists of having labeled data to train and trying to predict a label from known features. The algorithm learns by comparing actual output with the correct outputs to identify errors. Some methods are classification, regression and prediction. Use cases are where historical data predicts a future event (i.e. stock prices). Unsupervised is having unlabeled data and trying to group similar data points together to gain insights. Here the the right answer is not provided. The main purpose is to explore the data and find attributes that separate segments. Typical techniques include self-organizing maps, nearest neighbor mapping and single value decomposition (i.e. think if one guy with red shirt bought this, maybe another will also — very simple example I know…). While Reinforcement is where an algorithm learns an action from experience (i.e. practice). This is used for robotics, gaming and navigation. The algorithm discovers which actions yield the best results through trial and error.

Linear Regression:
The purpose of regression is to draft a line that is as close as possible to every data point. The residual is the difference between the data point (i.e. observation) and the line of best fit.

Topics:
* How to conduct Data Exploration
* How to training our model using linear regression

The Setup:
* The example uses Python 3.6within Jupyter notebook with the below dependencies
Matplotlib 2.1.2
Numpy 1.14.1
Pandas 0.20.3
Seaborn 0.7.1
Scikit Learn 0.19.1

Warning:
* Feel free to review the docs for additional arguments for the methods.

How to conduct Data Exploration:
Before we start splitting data and training models we will need to do a little data exploration to understand the information that we have. Obtaining a clear understanding of the data we are working with provides a better foundation to analyze and apply machine learning techniques later.

Step 1 — Conduct Imports:
To begin we must conduct the necessary imports. We will use pandas and numpy for data manipulation and apply aggregate methods. We will use seaborn and matplotlib for data visualization.

Step 2 — Obtaining Data
We will use the pandas method .read_csv() to read in the data. This data resides in the same directory as my working jupyter notebook.

Step 3 — Investigating Data
To investigate we will use the .info() method to identify the rows and column details. The .describe() method provides statistical insights on the dataset. While the .columns method returns a list of all the column names.

Step 4 — Visualize the Data
Seaborn’s .pairplot() method is used to display the numeric correlations of the data from a quick high level point. The seaborn .distplot() method displays the distribution of a desired feature (i.e. column). While the seaborn .heatmap() method provides a clearer picture of feature correlations.

How to training our model using linear regression:

Step 1 — Split dataset into “X” features and “y” labels:
This step is required to prepare us for the fitting (i.e. training) the model later. The “X” variable is a collection of all the features. Think of this as a house’s age, number of bedrooms, number of rooms and etc. The “y” variable is the target label which specifies the price. Our goal will be to identify the best value to set a feature to obtain the maximum price.

Step 2 — Split the data into a training and test set:
This allows for use to train our model on the training set and evaluate the built model against the test set to identify errors. Still trying to fully understand the random_state argument, but here is some help(https://stackoverflow.com/questions/45089858/what-does-the-random-state-parameter-do-in-sklearns-parametersampler).

Step 3 — Create and Train the Model:
Here we create a LinearRegression Object and use the .fit() method to finally train the model. Upon completion of the model we should receive confirmation that the training has been completed.

Step 4 — Evaluate the Model by reviewing the coefficients:
By reviewing the coefficients, we are able to evaluate the model. The Avg. Area Income element in the figure below could be translated as for every 1 unit that that feature increases, the price of the house increases as well by 21.53. Obviously, this is a test dataset and these do not pass the real world test. However, there are others out there to gain practice.

Tips, Tricks and Lessons Learned:
- Web scraping — Great for creating your own custom datasets…
- (Book) An Introduction to Statistical Learning — The Math — great for understanding the math behind the algorithms
- (Book) The Elements of Statistical Learning
- (Data Exploration Tools) Using the .info() method to investigate data
- (Data Exploration Tools)Using the .describe() method to investigate data
- (Data Exploration Tools)Using Distribution Plots
- (Data Exploration Tools)Using pairplot() method for correlations
- (Data Exploration Tools)Using heatmaps() method for correlations
~ Guilty Pleasures
a) Transcending Time
b) Easy Listening

Well, until next time…

It’s not because things are difficult that we dare not venture. It’s because we dare not venture that they are difficult. ~ Seneca the Younger

--

--

Keith Brooks

Hi…This is intended to document my journey on my path to data science. I am a process oriented individual who loves to try new things and continue to learn.