Linear Regression Made Easy: Python vs BigQuery Comparison!

Published in

Data Solutions — part of Incentro

8 min readAug 7, 2024

Do you want to create your own Machine Learning algorithms, but not sure where to start? How hard is it to set up a basic algorithm? Read on, to learn how to set up a small linear regression model in Python or with SQL in BigQuery.

But first

This article has overlap with my previous article. This time we’ll look at a different Machine Learning algorithm and use different data, which will add a couple of challenges that the previous article did not run into. Curious about the previous article? Check it out here.

Linear Regression

Linear regression, that’s the thing with the straight line right? Well yes, it indeed has to do something with a straight line. In short: Linear regression is a way to figure out the straight line that best represents the relationship between two or more things (this line is called ‘the line of best fit’). It can help predict one thing based on another. For example, based on someone’s age I can try to predict that person’s height (up to a certain age at least).

In this blog we will look at data about used cars, and try to predict the price we can sell our used car for.

Data analysis

Allright, data about cars. I have to admit, I know very little about cars and quite frankly don’t care much about them either (sorry to all car fanatics), so if I state something wrong please forgive me. Anyway, the dataset we are using has information of 7819 used cars with some attributes. Once again I took this data from Kaggle, you can download and play with it yourself over here. (Note: I simplified the data slightly, taking only a couple of columns for demo purposes and filtering out some rows as well). The data looks like this.

You’ll see that we have a bunch of car names, with some information about them: the amount of km they’ve driven, the mileage and the power of the engine. Lastly, we see the price they’ve been sold for.

Now, even I see that something weird is going on with the selling_price. The cars seem very expensive. The reason for this is that this dataset has the price in Indian Rupees, which currently has a currency exchange rate of 1 Euro to 90 Rupees. We’ll fix this in the next chapter.

Python

Okay, this is the moment you’ve all been waiting for. Python code! Again, the same disclaimer holds that was stated in the previous article: During these technical blocks I’ll generalize and simplify a lot.

So, let’s load the data to Python and fix the selling_price issue. With the currency exchange rate of 1 in 90, we can simply divide the selling_price by 90. While we’re at it, you might notice that the mileage and engine columns have numbers and text in them. Generally, if possible, it’s always easier to only handle the numbers only. So, let’s remove the kmpl and the CC and then save these columns as numeric columns.

import pandas as pd

data = pd.read_csv("Car_data.csv")
data['selling_price'] = data['selling_price'] / 90

# Set mileage and engine to floats
data['mileage'] = data['mileage'].str.replace('kmpl', '') 
data['mileage'] = data['mileage'].astype(np.float64)
data['engine'] = data['engine'].str.replace('CC', '')
data['engine'] = data['engine'].astype(np.float64)

Before we continue, let’s step back and think about what we’re going to do. The goal is to predict the price of a used car. We’re going to assume that the used columns are all playing a role in the determination of the price of a used car. This includes the column name, which well, contains the names of the cars. You might notice, that its the only column without numbers in it.

And this creates a challenge: Generally Machine Learning models work with fancy mathematics under the hood, aka numbers. Numbers are nice, they make sense. Text is (much) harder to interpret for machines. Even worse, most ML models don’t even accept text as input. So how do we go around this, because we still want to include the name of the car as an input variable.

There are multiple ways to handle this. For simplicity sake, we’re going to use a fairly easy way to transform the text to numbers: One Hot Encoding. I could write a whole blog around what One Hot encoding is and how it works, but that’s not the scope of this blog. So for now, all I’m going to give you on this topic is that it transforms text to vectors of zeros and ones in a way that ‘makes sense’. (I encourage you to check out on your own what One Hot Encoding exactly is, or you can let me know if you want me to make a blog post about it :D). In Python code it’s very easy, we just import a package and write 2 lines of code.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(input[['name']])

The result of this line is a variable called X_encoded, which is going to replace the input[‘name’] column. X_encoded is a so called ndarray (or N-dimensional array), which you could interpret as a table, full of zeros and ones. Here’s a small snippet

All right, that was a hassle, but from this point on we’re done with the pre-processing steps (which often is a larger chunk of the work than actually training a Machine Learning model). Time to grab everything together and start training a model!

In the previous article, I had two datasets, one for training and one for testing the model. Now we don’t have an extra test set, so we’ll have to create it ourselves. In this block, we grab all our input (the numeric columns and X_encoded) and randomly take out 20% of the rows. We’ll use that 20% to test, after the model trained on the remaining 80%, how well it’s performing on unseen data. (Note: the 80% is called X_train and y_train, the 20% is called X_test and y_test)

# Define the numeric columns used
X_numeric = input[['km_driven', 'mileage', 'engine']]

# Combine the encoded features with the numeric features
X = np.hstack((X_encoded, X_numeric.values))
y = input['selling_price']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now we can finally import a Python package for a Linear Regression model, train it, and test against the 20% of the unseen data.

from sklearn.linear_model import LinearRegression

# Create a linear regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)

Let’s conclude with evaluating the result. As in the last article, we’ll just look at the RMSE and MAE. Keep in mind that using only these two metrics is generally not enough to conclude much when determining if your model is good (enough) for a real life application. Anyway, let’s predict the monetary value for the 20% of the rows, and compare it with the actual prices of them.

from sklearn.metrics import mean_squared_error, mean_absolute_error

# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate performance metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

This results in an MAE of 1182 and RMSE of 3037. In other words, on average, our model predicts a price which is roughly 1182 euros off the actual sold value. Pretty decent, right? Let’s head to BigQuery now!

Google BigQuery

The good news: This section will be significantly shorter (in a good way). Once again, we’ll upload our data to BigQuery, which you could do easily with the UI (or write some SQL/Python code, however you fancy). To train the model, we have a very simple SQL query (don’t forget to divide the selling_price by 90 to get it into Euros).

Here we see a huge benefit of Google BigQuery coming to life. I’m going to give you a minute to see what we did here differently (or rather, did not do at all), compared to our Python code.

That’s right, we did nothing with the mileage and engine (removing kmpl, CC or changing the column types to numbers (floats). But wait, we also did nothing (!!) with the name column. No One Hot encoding, or any other tool to transform text to numbers. BigQuery takes care of all of this for us, without even bothering us about it. This makes BigQuery very nice, especially for first-timers in the world of Machine Learning, who might not even have heard of One Hot encoding (or similar methods) or realize it is needed.

Okay, is BigQuery then also any good when looking at the results? Sure, it performs pretty much the same as our Python logic, check it out:

Which gives us an MAE of 1192 and RMSE of 2392. Awesome, that’s pretty much the same as Python!

(Careful) Conclusion

I’m going to repeat the first part of the conclusion as stated in the previous article:

This conclusion is not going to call a winner (this blog never intended to do so, and frankly, depends on your use case). It would also be impossible based on the above, as it is just 1 test case, with frankly very little data. But there are some (comparative) takeaways that we can get from above. (The last point is copied from the previous article, as it generally counts regardless of the ML model you want to train).

Both Python and SQL code in BigQuery ML can train and predict with Linear Regression models. This is nice, since it allows a wide range of people with different skills to set something up.
BigQuery ML does a lot in the background, without the user even needing to have any knowledge of some steps you generally take when using Python. The main one visible in this blog is the encoding of the name column, but also removing and cleaning mileage and engine are tasks that BigQuery did for you. This makes it powerful as a starting/learning point for Machine Learning, without knowing much about it.
One negative note: In my test, BigQuery took a longer time (about 1 minute) to actually train the model. This is partly because it’s ‘fixing’ the data for us first, which we explicitly did in Python ourselves. Still, doing all the above steps in Python took my machine about 5 seconds in total.
Assuming that your data is saved in BigQuery, then BigQuery ML offers a huge benefit, in the sense that no data movement is needed, no compute engine has to be set up to run your Python script and much more. It’s generally way less of a hassle in terms of architecture to set up a running Machine Learning model in BigQuery ML. In most cases it could also be a cheaper option.

Want to learn more about BigQuery or BigQuery ML? Via Incentro I offer different (online) classes around SQL and BigQuery for all levels. Interested? Check out this link