The Data Analysts’ Toolkit: Models

Karen McEwen
The Startup
Published in
7 min readAug 3, 2020

You’ve cleaned up your data and done some exploratory data analysis. Now what? As data analysts we have a lot of tools in our toolkit, but just like a screwdriver might be used to hammer in a nail, it isn’t the best tool for the job. Our tools are models, or if you prefer the mathematical term, algorithms. They allow us to make sense of the data we have collected and to make predictions.

There are three basic types of models, depending on the type of data. For continuous numerical data we have a variety of regression techniques. These are our screwdrivers and wrenches. Fairly simple to understand and use, they bring data together to fit them to some sort of line or multidimensional plane. For categorical or discrete data, we have clustering and classification models. These are our saws and knives. They separate the data into different pieces of like versus unlike. With so many choices, it may be difficult to know which tool to use under which circumstance. So, let’s look at each in turn.

Numerical regression models seek to find the best line to fit continuous numerical data. They can be linear, in which the dependent variable (usually called y) is fit to one or more independent variables using some type of polynomial function. Nonlinear regression is used to fit one or more independent variables to a logarithmic, exponential, or sigmoid function.

Linear regressions include:

1)Single Linear Regression: one independent variable fit to a basic line:

  • y = mx + b, where m is the slope of the line and b is the value of y at x=0

2) Multiple Linear Regression: 2 or more independent variables fit to a line of order 1:

  • y = mx + nz + c, where m and n are the slopes of the line in the x and z planes, and c is the value of y at x=z=0

3) Polynomial Regression: both single and multiple linear regressions are actually special cases of polynomial regression, where one or more independent variables fit to a polynomial of order greater than 1:

  • y = m0 + m1x + m2x2 + m3 x3 + …

Nonlinear regressions include:

1)Logarithmic Regression

  • y = alog(x) or y = bln(x)

2)Exponential Regressions

  • y = e^x + b

3)Sigmoidal Regressions: use functions that create an S-curve, such as sine and cosine

  • y = asin(x) + b or y = dcos(x) + e

In each of these cases, a line (or plane) is fit to continuous data. Note that it is also possible to split up your data into sections and fit different lines to each section. There are various techniques that you can use to determine the best fit line, but that is for another article.

What if you don’t have continuous data? What if you have only two or three discrete values: yes/no, for instance, or small/medium/large? Or perhaps twenty options, but each is apparently independent of the other. From a business standpoint, you may be asking about which customers are likely to default on a loan, or determining the demographics of customers purchasing a particular product. In these cases you would find it difficult to fit a linear or nonlinear regression to your data. Instead we have other types of tools that sort data rather than fit it: classification models and clustering models. While similar, the chief difference is that with classification models, you already have predefined classes into which you sort your data. For clustering models, the data is sorted into like categories, without knowing what those categories are ahead of time. (Note that these models can also be used on continuous data, but you will need to bin the continuous data into discrete units.) While regressions fit a line to the data, classification and clustering draw lines or planes between the data, separating them into categories of like vs unlike.

Classification Models include:

  • Decision trees: Here, the data begins by being broken into two categories with boolean results, True or False. At each juncture, a new boolean is considered, until all the like data splits into separate categories and can no longer be separated. This technique can get cumbersome once you go beyond a handful of branches.
  • Random Forest: Similar to decision trees, except you start with several different trees.
  • K-Nearest Neighbor (KNN): In this classification technique, you start with K number of clusters and each data point is assigned to the center of the cluster to which it is nearest. It is similar to K-Means Clustering (below) but the analyst chooses the number and location of clusters.
  • Logistic Regression: The name sounds like this should be similar to logarithmic regression, but it is actually entirely different. In fact, it isn’t even a regression, but a classification algorithm. It is used to determine the probability of success or failure, or the probability of one outcome over another.

Clustering Models include:

  • Hierarchical clustering: Generally used with smaller data sets, as it becomes quickly unwieldy with too much data. Starts with a single cluster of the entire data set, and with each iteration, breaks into more clusters until one runs out of data, or all data has been assigned a branch that does not change. Similar to a decision tree, except that you do not know the categories ahead of time. Usually shown on a dendrite diagram.
  • Agglomerative clustering: A special case of hierarchical clustering, but beginning from the bottom up. Each data point begins in its own cluster, then with each iteration, data are linked together into clusters that are similar. Like hierarchical clustering, this works best with smaller data sets, because of space and time limitations.
  • K-means: A method of partitioning observations into k clusters, where the data within each cluster is more closely related to one another than the data outside the clusters. It is done iteratively, so that at each round, the location of each cluster center changes until all points have been assigned to a cluster and the clusters no longer change. K-means clustering can be used with both large and small data sets. It works best with sets of data that can form into roughly spherical sets.

Classification and clustering models can be used with numeric data or non-numeric data that have been one-hot encoded. That is, the textual data has a limited number of discrete values and can be converted to individual numbers, that do not mean anything. For example, you have three clothing sizes: Small, Medium, and Large. You can encode these as 1 for small, 2 for medium, and 3 for large. However, they are merely classifications. In this case 1+2 != 3.

Like the regression models above, these models can be used both to describe your current set of data, and to make predictions about new data. Using machine learning, you can program these models by training them on sets of data you already know, to predict the data that you do not know. The mechanics of that is beyond this article, but there are many great resources on machine learning.

Conclusion:

We have many tools for modeling data in our data analyst toolkit. Regression models are the screwdrivers and wrenches of our kit, pulling continuous data together and fitting it to some sort of line or plane in one or more dimensions. Classification and clustering models are our saws and knives, cutting apart the data and separating it into groups or clusters of like versus unlike. These are our most basic models in our toolkit, and it is important to understand when we can use one type of model or another, and which is the best model for our data.

For Further Learning:

For a great background in data science, try Confident Data Skills, by Kirill Eremenko, a data scientist out of Australia who is head of SuperDataScience. You can check out his online courses on Udemy as well. He is very enthusiastic about data science and his courses are well plotted and easy to follow.

For a really in-depth look at the mathematics behind these models and other machine learning models, look at Machine Learning: A Concise Introduction, by Steven Knox. Steve is the head of data analytic at the NSA, and a former colleague of mine. His book won the award for best prose in a textbook, and is straightforward and easy to follow, with a depth of mathematical rigor that most data analysts tend to gloss over.

For a great online course, try IBM’s data science track on Coursera, a series of nine courses using python for data science, that covers everything from the basics of data analysis up through machine learning models. It is especially well done, with lots of labs, assignments, and projects to be done, including a final capstone project to complete their data science certificate.

And, of course, there is the data science section of Medium, which offers a wide variety of data science topics from beginner to advanced, and has been a wealth of information for me as a career changer.

About me: I am a lifelong user of data, originally as an environmental engineer, then (surprisingly) in the field of ministry. Having left that world, I have relearned old data analytic techniques and the wealth of new tools, to become a freelance data analyst. You can find me on LinkedIn.

--

--

Karen McEwen
The Startup

Freelance Data Analyst trying to make sense of the world. Every point tells a story. Is your data being heard?