How Much Math Do Data Scientists Need?
Why study linear algebra, probability theory, statistics and optimization methods
Mastering the basic methods of machine learning and being able to interpret the results of built models is a must-have for a data scientist. However, in order to solve non-standard problems, it is important to understand the laws of mathematics and statistics. Let’s figure out exactly how mathematics helps data scientists and what topics you need to know.
With the knowledge of mathematics, neural networks and machine learning, you will understand how most things work. With its help, you can correctly process the data and train the model — an algorithm that finds the optimal solution to the problem.
At the end of this post, I’ll share some great sources for learning essential math skills for data science!
The most important sections of mathematics for Data Science are:
• Linear Algebra
• Theory of Probability and Mathematical Statistics
• Mathematical analysis and Optimization methods
• Time Series
Linear Algebra
A large branch of mathematics dealing with scalars, sets of scalars (vectors), arrays of numbers (matrices), and sets of matrices (tensors).
Almost any information can be represented using a matrix. Example: an MRI image of the brain is a set of flat images, layers of the brain. Each flat image can be represented as a table of gray intensity, and the entire MRI image will be a tensor. Then you can find the spectrum of the matrix — the set of all vector eigenvalues. With the help of spectra, it’s possible to classify data into norm and pathology and to identify, for example, whether a person has a brain cancer.
Now let’s take a business-related task — to analyze and predict the profit of a store chain. An individual store can be described by a set of numbers that show the amount of profit, the number of goods, the number of working hours per week, opening and closing times. The set of these numbers will be a vector. For the entire network of stores, a set of vectors will make up a table with numbers or a matrix.
Partially linear algebra is used in large companies when developing recommender systems (for example, Instagram). Knowledge about matrices, their properties and operations with them will help you understand how the NumPy library methods work, how important statistical values for big data are calculated.
Theory of Probability and Mathematical Statistics
Statistical studies are the prototype of data science: they were also carried out in order to find patterns.
For example, you need to determine which of two commercials is more successful. To do this, you need to run ads with the videos and get the result. Suppose 1,000 users clicked on the first one and 1,100 clicked on the second. Probability theory and statistics can help to understand whether this is a coincidence or a pattern. By using statistical methods, it’s possible to identify the correlation (dependence) between variables, the day of the week, and the number of purchases on the marketplace.
In order to calculate probabilities and analyze which fluctuations and connections are random and which carry meaning, knowledge of random variables, their characteristics and distribution is needed; you also need to be able to test statistical hypotheses.
Mathematical Analysis and Optimization Methods
Mathematical analysis is a branch of mathematics, which includes differential and integral calculus.
In data analysis, it’s used mainly (although by no means only) for optimization — the selection of the best system parameters to minimize or maximize the objective function. Almost every machine learning algorithm aims to minimize the estimation error given various constraints; this is the task of optimization. For example, those who are engaged in transport optimization minimize time, costs for toll highways, fuel, and vehicle operating costs.
How Deep Do You Need to Know Math?
Applied Data Science degree from MIT and 4.0 GPA is definitely not required to become a data scientist. A junior specialist may have enough basic knowledge, but in order to grow in the industry, you will have to go deeper.
Online Sources
Here are some of the best resources for learning essential math skills for data science:
- Statistics & Mathematics for Data Science on Udemy
- Data Science Math Skills on Coursera by Duke University
- Fundamental Math for Data Science on Codecademy
The answer to the question basically depends on how much you want to earn in this industry and how high you want to climb data science ladder.