Why Data Analysts should learn a little Data Science

Chris Bruehl
Learning Data
7 min readFeb 29, 2024

--

Photo by Jametlene Reskp on Unsplash

Data analysts and data scientists work together frequently, but what exactly is the line separates them?

Most folks would say data scientists have stronger backgrounds in coding, math, and statistics, and I would agree.

But is it fair to say that analysts (or data professionals as a whole) could benefit from an understanding of data science techniques?

In my opinion, absolutely! While the average analyst may not have the same depth of knowledge of math and statistics as the average data scientist, I would argue analysts have a huge incentive to augment their skills with Data science techniques.

Lets take a look at a few data science techniques that analyst can use to increase the impact of their research.

Isolating Impact With Linear Regression

Of all ML algorithms, Linear Regression is likely the most well known. If you took a stats or econ class in college, you likely built a linear regresion model.

Linear regression takes the generic form:

Linear Regression Equation

y is the variable of interest, or target variable, and the right hand of the equals sign is the equation that predicts our target.

Our first term, β0 represents the intercept of our line when all variables are equal to zero. The next term, β1x1, represents the slope(beta) of our model for the feature x1.

While building a statistically sound linear regression can take a ton of effort, analysts don’t necessarily need to be as rigorous. The beautiful thing about linear regression is that it allows us to isolate the impacts of variables on a given metric of interest, after accounting for other features.

So when analysts get stuck on something like “I have 10 variables that are all positively related to sales, but which one is most impactful?” Linear regression is a powerful tool for assessing this.

Below, we have fitted model coefficients for predicting insurance premiums based on demographic variables.

A customer’s age, bmi (body mass index) and number of children were all positively correlated to price. But linear regression allows us to understand how these all relate to each other. According to our model, we can see our coefficient estimates (in the ‘coef’ column) and the probability these are significant (‘P>|t|’).

This allows us to understand that a one year increase in age leads to an increase of ~$240 dollars, a one unit increase in BMI corresponds to a $332

Being able to confidently isolate the impact of variables will sharpen any analysis and allow you to make better informed decisions.

Feature Importance:

Feature importance has a similar use case to linear regression, but it should be used to help screen large numbers of features.

Have you ever been in a situation where you need to understand what impacts a given metric, like sales, but had dozens or more variables that could be impactful.

You don’t need to be a master of data science to fit a simple Random Forest or Gradient Boosted Machine (GBM) to get feature importance, which doesn’t measure the average impact or direction of impact like linear regression does, but still tells us which features in our data are most useful in predicitng the target.

This can allow us to quickly hone in on a handful of key features rather than getting buried in endless EDA.

Below, I’m going to fit a basic GBM, but Random Forests can be used from sklearn as well. The use case for analysts is not fine-tuning one of these models to perfection like a data scientist might do, but even very rough tree based models are quite accurate, and feature imporance can help us hone in on the variables that have the most impactful and important columns for predicting our target variable.

In the code below, we’re predicting whether or not someone makes over 50k USD in income. There are a ton of variables in this data. By fitting a RandomForest or GBM, we can identify which features were most useful for predicting income. We can then use this knowledge to hone in on the handful of most impactful variables using our traditional exploratory analysis.

Clustering:

Clustering is an unsupervised learning technique that helps reveal natrual groups within our data.

Below, we just have two variables, x and y. K-Means clustering requires us to ask for K clusters, or groups. Below, K=3, so if the analyst is looking for three distinct groups, K-Means clustering will classify points into cluster based on how similar they are in terms of descriptive statistics.

This can help us identify groups in our data, like customer segments. For example, if we work as an analyst at a grocery chain, we might want to know what different groups of customers buy. We might be able to better market to our customers, if we know for example, that one group frequently buys Deli products and Frozen foods, while another buys milk and household products like paper towels and laundry detergent.

Below is a 5 cluster solution to a grocery store customer base. The clusters I describe above are present at clusters ‘4’ and ‘2, respectively.

While a well trained data scientist or statistician may end up with a more fined tuned model, an analyst can do a quick pass at clustering and still end up with a reasonable understanding of basic customer profiles.

A/B Testing:

Finally, A/B testing is a great skill that isn’t in most analytics curriculum. While some tests can be quite complex, most A/B tests require introductory level statistics.

A/B tests allow us to apply a scientific framework to business decisions. We might be trying to optimize a web page design, for example. A brash decision would just be to change the site overnight and hope everyone is happy and that our sales improve. But if our site is poorly designed, we might lower sales and frustrate our customers.

The other problem is that measuring changes in performance in this type of scenario is difficult. What if marketing is running a new ad campaign? Did we recently change our product pricing?

There are many external factors that can impact sales. This makes it almost impossible to say with confidence that the new website is better or worse than our old one, because there are many other factors that affect the metric we’re focused on.

A/B tests allow us to set up a test and control approach. We might set up an experiment where only 10% of website visitors see our new page, while 90% visit the old one.

This allows us to measure the relative difference in performance while controlling for any external factors that may impact sales outside of our website change.

Below is a tool that can be used to help set up A/B tests. We rely on statistical t-tests to help us determine how much data we need to collect before we can confidently make conclusions about our experiment.

https://www.evanmiller.org/ab-testing/sample-size.html

Conclusion

In all, even learning a single one of these skills can help set you apart as an analyst. The line between data analyst and data scientist is blurry, so even though something may seem out of your wheelhouse, taking the time to add a new skill to your data toolbelt almost always opens new doors.

If you are interested in learning more about these topics, check out our statistics, data science, and Python courses at the Maven Analytics Website or Udemy.

Ready to build practical, job-ready data skills of your own?

Spring Savings: Up to 40% off at Maven Analytics!

Create your custom learning plan today, and save up to 40% on all-access memberships when you upgrade to a paid account.

All Maven memberships include:

✓ Unlimited access to ALL courses & paths

✓ Customized learning plans

✓ Skills assessments

✓ Free practice data sets

✓ Guided projects

✓ Portfolio builder & Showcase

✓ Private student dashboard

✓ Live instructor chat support

Join today and see why we’ve earned 50,000+ perfect 5-star reviews from students around the world.

This is a limited-time deal; take advantage of the savings today!

--

--