Data Science Glossary

Published in

100daysofdscode

8 min readMar 23, 2019

Let’s talk about the terminologies used by Data Scientist and Machine Learning Engineers.

Getting started in data science can be amazing, especially when you consider the diversity of theories and techniques a data scientist requires to understand in order to do her job completely. Even the term “data science” can be somewhat vague, and as the domain obtains popularity, it seems to lose representation.

To help those new to the field stay on top of industry jargon and terminology, we’ve put together this glossary of data science terms. We hope it will serve as your handy quick reference whenever you’re working on a project, or reading an article and find you can’t quite remember what “ETL” means.

Algorithms

An algorithm is a set of rules we give a computer so it can take values and manipulate them into a usable form. This can be as easy as telling someone a direction to your apartment or as complex as developing an equation that predicts how the salary of a new employee.

Big Data

Big data is a large amount of data with value. Big data is more about strategies and tools that help computers do complex analysis of very large (read: 1+ TB) data sets. The problems we must address with big data are categorized by the 4 V’s: volume, variety, veracity, and velocity.

Classification

Classification deals with categorizing a data point based on its similarity to other data points. It’s a supervised learning technique. You take a set of data where every item already has a category and look at common traits between each item. You then use those common traits as a guide for what category the new item might have.

Regression

Regression is another supervised machine learning approach which focuses on how a target value changes as other values within a data set change. Regression problems generally deal with continuous variables, like predicting the next customer that will purchase an item.

Clustering

Clustering techniques typically collect and categorize sets of data points into groups that are “sufficiently similar,” or “close” to one another. “Close” varies depending on how you choose to measure distance. Complexity increases as the more features are added to a problem space.

Supervised Machine Learning

Using Supervised learning technique, you give the computer a well-defined set of data. All columns are labeled and the computer can describe what he needs to obtain. It’s similar to a professor handing you a syllabus and telling you what to expect on the final.

Unsupervised Machine Learning

Using unsupervised learning technique, the computer builds its own intuition of a set of unlabeled data. unsupervised machine learning finds patterns within data and usually deal with classifying items based on shared traits.

ETL (Extract, Transform, Load)

It explains the three stages of bringing data from numerous places in a raw form to a screen, ready for analysis. ETL systems are ordinarily gifted to us by data engineers and run behind the scenes.

Data Mining

The method of uprooting actionable insight out of a set of data and putting it to good use. This includes everything from cleaning and organizing the data; to analyzing it to find meaningful patterns and connections; to communicating the useful value to stakeholders.

Data Exploration

This is a process where a Data Scientist will ask basic questions that serve her to understand the context of a data set. whereby a data analyst uses visual exploration tools to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.

Variance

The variance of a set of values measures how spread out those values are. Mathematically, it is the average difference between individual values and the mean for the set of values. The square root of the variance for a set gives us the standard deviation, which is more intuitively useful.

Correlation

Correlation is the measure of how much one set of values depends on another. If values increase together, they are positively correlated. If one values from one set increase as the other decreases, they are negatively correlated. There is no correlation when a change in one set has nothing to do with a change in the other.

Data Visualization

The art of communicating significant data visually. This includes infographics, traditional plots, or even full data dashboards.

Data Journalism

This deals with telling fascinating and significant stories with a data-focused approach. It has come about naturally with more information becoming available as data. A story may be about the data or informed by data. There’s a full handbook if you’d like to learn more.

Business Intelligence (BI)

Similar to data analysis, but more narrowly focused on business metrics. The technical side of BI involves learning how to effectively use software to generate reports and find important trends. It’s descriptive, rather than predictive.

Training and Testing

This is involved in the machine learning workflow. When constructing a predictive model, you first offer it a set of training data so it can build understanding, then you pass the model a test set, where it applies its understanding and tries to predict a target value.

Overfitting

Overfitting happens when a model considers too much information. It’s like asking a person to read a sentence while looking at a page through a microscope. The patterns that enable understanding get lost in the noise.

Underfitting

Underfitting happens when you don’t offer a model sufficient information. An example of underfitting would be asking someone to graph the change in temperature over a day and only giving them high and low. Instead of the smooth curve, one might expect, you only have enough information to draw a straight line.

Data Engineering

Data engineering is all about the back end. These are the people that build systems to make it easy for data scientists to do their analysis. In smaller teams, a data scientist may also be a data engineer. In larger groups, engineers are able to focus solely on speeding up analysis and keeping a data well organized and easy to access.

Quantitative Analysis:

This field is highly focused on using algorithms to gain an edge in the financial sector. These algorithms either recommend or make trading decisions based on a huge amount of data, often on the order of picoseconds. Quantitative analysts are often called “quants.”

Mean (Average, Expected Value)

A calculation that gives us a sense of a “typical” value for a group of numbers. The mean is the sum of a list of values divided by the number of values in that list. It can be deceiving used on its own, and in practice, we use the mean with other statistical values to gain intuition about our data.

Summary Statistics

Summary statistics are the measures we use to communicate insights about our data in a simple way. Examples of summary statistics are the mean, median and standard deviation.

Time Series

A time series is a set of data that are ordered by when each data point happened. Think of stock market prices over the course of a month, or the temperature throughout a day.

Residual (Error)

The residual is a measure of how much a real value differs from some statistical value we calculated based on the set of data. So given a prediction that it will be 20 degrees Fahrenheit at noon tomorrow, when noon hits and its only 18 degrees, we have an error of 2 degrees. This is often used interchangably with the term “error,” even though, technically, error is a purely theoretical value.

Data Wrangling (Munging)

The process of taking data in its original form and “taming” it until it works better in a broader workflow or project. Taming means making values consistent with a larger data set, replacing or removing values that might affect analysis or performance later, etc. Wrangling and munging are used interchangeably.

Feature Engineering

The process of taking the knowledge we have as humans and translating it into a quantitative value that a computer can understand. For example, we can translate our visual understanding of the image of a mug into a representation of pixel intensities.

Feature Selection

The process of identifying what traits of a data set are going to be the most valuable when building a model. It’s especially helpful with large data sets, as using fewer features will decrease the amount of time and complexity involved in training and testing a model. The process begins with measuring how relevant each feature in a data set is for predicting your target variable. You then choose a subset of features that will lead to a high-performance model.

coefficient

“A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity (Ex.: x in x(y + z), 6 in 6ab”[websters] When graphing an equation such as y = 3x + 4, the coefficient of x determines the line’s slope. Discussions of statistics often mention specific coefficients for specific tasks such as the correlation coefficient, Cramer’s coefficient, and the Gini coefficient.

cross-validation

When using data with a machine learning algorithm, “the name was given to a set of methods that split up the data set into training sets and test sets. The training set is given to the algorithm, along with the correct answers and becomes the set used to make predictions. The algorithm is then asked to make predictions for each item in the test set. The answers it gives are compared to the correct answers, and an overall score for how well the algorithm did is calculated

Dependent variable and Independent Variable

The value of a dependent value(y) “depends” on the value of the independent variable(y). If you’re measuring the effect of different sizes of an advertising budget on total sales, then the advertising budget figure is the independent variable and total sales are the dependent variable.

Dimension reduction

Dimension reduction is the means of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.“We can use a technique called principal component analysis to extract one or more dimensions that capture as much of the variation in the data as possible. Dimensionality reduction is mostly useful when your data set has a large number of dimensions and you want to find a small subset that captures most of the variation.

Model

“A specification of a mathematical (or probabilistic) relationship that exists between different variables.”[grus] Because “modeling” can mean so many things, the term “statistical modeling” is often used to more accurately describe the kind of modeling that data scientists do.

predictive analytics

The analysis of data to predict future events, typically to aid in business planning. This incorporates predictive modeling and other techniques. Machine learning might be considered a set of algorithms to help implement predictive analytics. The more business-oriented spin of “predictive analytics” makes it a popular buzz phrase in marketing literature.