Using Databases to Help Feed the Machine Learning Revolution

12 min readFeb 18, 2024

Harvesting data for machine learning. Original image by author. (SDXL 1.0 via Amazon Bedrock)

One of the main hopes of any supervised machine learning (ML) model is that the model will generalize well, serving its intended purpose effectively — sometimes over long periods of time. Achieving this idyllic goal of model generalization requires that data used to train the model is carefully selected and balanced so the model works well on novel, unseen data in times ahead. When model training data is not properly sourced or shaped, it can result in a biased or overfit model that underperforms relative to the true potential.

Give or take, proper rigor in data preparation is responsible for about 80% of the success in any supervised machine learning outcome. This post focuses on data preparation because of the weight and influence on modeling outcomes.

This article addresses data preparation techniques for tabular, supervised ML problem types. The main idea of this writing is to demonstrate how to use the power and scale of data lake and data warehousing platforms to help sample and prepare data for ML model training. This writing does not address model training, model inference, or ML operations topics. Ideally, the reader should already have a basic understanding of the phases in the ML lifecycle, specifically data collection and data preparation (wrangling) phases.

Data setup

To create teachable examples of using a database to prepare machine learning (ML) data, this post uses New York City’s open-source taxi dataset as inspiration. First, Apache Parquet data files containing the taxi data were copied to Amazon S3 storage. Next, Amazon Athena table definitions were then created on top of the Parquet files, effectively building a serverless data lake.

The SQL syntax shown within is based on Amazon Athena’s capabilities. However, the concepts can be translated to match the syntax of other database engines. Be aware of any limitations of your particular database platform. In this context, the term database means data lakes and data warehouses suitable for feeding ML systems, not overloaded transactional databases.

Rather than recreating the specific NYC taxi dataset examples, readers are encouraged to apply the principles and learnings to their own data schemas and ML ambitions.

Maximizing Compute Resources in the ML Pipeline

Converting raw data into a format suitable for training machine learning (ML) models requires compute power. Up to the point of model training, a ML pipeline typically has distinctly different tiers: the database tier, the data wrangling tier, and the model training tier. Ideally, these tiers should be decoupled and scale independently to provide the right compute capacity and the right chip selection at each stage in the pipeline.

Databases efficiently process very large datasets and provide basic data preparation functions like combining, filtering, sorting, and aggregation. Rather than transferring huge data volumes from the databases to the data wrangling tier, a best practice is to use the database tier to complete as much of the initial data selection and preparation as reasonably possible before passing the refined dataset to later pipeline stages. This allows the data wrangling stage to operate on smaller amounts of data and focus on more advanced machine learning data transformations, like feature encoding and dimensionality reduction — which databases are not as well-suited for.

With this idea in mind, sections that follow describe how the database can be used to remove bias and properly balance and select data for subsequent data wrangling and ML training functions.

First — Know your target and exclude the outliers

When you are at step zero of your journey, for each of your ML use cases, I encourage you to squarely focus on the single data feature you are predicting — no matter whether you call it your dependent variable, y-value, target variable or otherwise, start with your attention on this feature.

For this writing, I will refer to this as the target value. The goal here is to observe statistics about the target value and exclude outliers early, before the data reaches the wrangling or modeling stage.

For example, here I will simulate predicting taxi fare amounts, a continuous numeric variable, framed as a regression problem. By querying a database of over 34 million historic taxi rides, we can jointly observe the range, average, and standard deviation of fares. As seen in Figure 1, fares range from -$450 to over $943,000, with most fares centered around an average of $13. However there is high variance as seen with the large standard deviation of $220.

With this exploratory data analysis (EDA), we have now identified extreme fare amount outliers. Removing extreme outliers early in our ML workflow, directly in the database query, will help improve model training and predictive performance later.

SELECT 
count(1) as observation_count,
min(fare_amount) as min_fare,
max(fare_amount) as max_fare,
round(avg(fare_amount),2) as avg_fare, 
round(stddev_pop(fare_amount),2) as sd_fare, 
round(skewness(fare_amount),2) as skewness_fare, 
round(kurtosis(fare_amount),2) as kurosis_fare
FROM (
SELECT 
rf.fare_amount
FROM nyctaxi.ride_fare rf 
JOIN nyctaxi.ride_info ri
on (rf.ride_id=ri.ride_id)
WHERE pickup_at >= date_add('year', -5, current_date)
)

**Figure 1 - Amazon Athena screenshot depicting key statistics about fare amount**

Business Rule Outlier Exclusion Method

You may notice there are negative fare amounts in the data. This seems unlikely to reflect real taxi trips since passengers do not typically receive refunds from trips. Before excluding this data from your machine learning model, consider whether it could represent valid scenarios like adjustments made after complaints. Think through the implications of removing records with negative fares — it may introduce bias. The main consideration is whether business logic can appropriately filter anomalous records before model training, without hampering model performance or fairness for production use cases. This requires thoughtful analysis of the data quirks and their impacts. If you elect to filter negatively valued records, this can be done by simply adding a new predicate to your SQL WHERE clause as shown.

...
WHERE pickup_at >= date_add('year', -5, current_date)
and fare_amount>0
...

Statistical Outlier Exclusion Method

Figure 1 showed there was at least one fare amount over $943,000 for a single trip. This is likely an error or outlier. A common practice to handle outliers is to remove values more than 3 standard deviations from the mean. For example, the following SQL code computes the lower and upper fare amount boundaries. Figure 2 reveals lower and upper limits of $-669.33 and $696.02 respectively. This means any fares outside this range would be considered outliers and potentially removed from the data extracted for model training.

SELECT
avg(fare_amount) - (3 * stddev_pop(fare_amount)) as outlier_lower_limit,
avg(fare_amount) + (3 * stddev_pop(fare_amount)) as outlier_upper_limit
FROM nyctaxi.ride_fare rf 
JOIN nyctaxi.ride_info ri
on (rf.ride_id=ri.ride_id)
WHERE pickup_at >= date_add('year', -5, current_date)

**Figure 2 — Outlier boundaries from the mean** ±**3 sigma statistical method**

Now that outlier boundaries are available, as shown in Figure 2, you can add an extra condition to your SQL WHERE clause as demonstrated below. You should carefully select which treatments to apply to the upper and lower boundaries. Here, a combination of business and statistical rules are used — and overlap with each other on the lower boundary. You should develop a plan to calculate and use these dynamic values as variables, or as inline views, in all subsequent queries inside your workflow.

...
WHERE pickup_at >= date_add('year', -5, current_date)
and fare_amount > 0 -- business rule outlier removal
and fare_amount > -669.33 -- mean - 3sigma outlier removal
and fare_amount < 696.02 -- mean + 3sigma outlier removal
...

A Recap on Outlier Handling

Using SQL, we have now handled outliers in the target variable, such as the extremely high $943,000 fare. This prepares the data for training a ML model, where the model will focus predictions on reasonable fares, around $700 or less, rather than being skewed by outlier fares. With outliers addressed, the rest of the article will demonstrate methods for carefully selecting appropriate data records to train the ML model.

Data sampling strategies

Your organization may have years of accumulated data containing millions to many billions of data rows collected. Given any tabular ML model you are building, the goal is for the model to generalize well on future, unseen data. Even if you could train a model with all your data, it may be cost or computationally prohibitive. One common practice is to use a sampling technique to strategically reduce available data and build a ML model from the subset. The hope is that your sample accurately represents the entire population, allowing effective models to be built with less time and computational investment. In the subsections that follow, I will cover random and stratified sampling techniques.

Simple Random Sampling

As the name implies this sampling method arbitrarily produces a sample from the entire dataset, giving each observation an equal opportunity to be included in the downsampled set. Your first question may be how large should my randomized sample be? In truth, there is not one exact answer that universally fits all use cases. Start by picking an arbitrary starting point, perhaps 1% to 10% or more of your dataset and scaling up to a larger sample only as needed.

During the model training phase, “sampled data” is split further into training and test data sets. If you have a proper sample size, loss metrics from the training set should be fairly close to the counterpart loss metric from the test set. If there is a large divergence in these loss metrics, a larger randomized sample population may help close the gap. At some point, when the training and test measures converge, you have a good indication the sample size is adequate. You can also perform model predictions against other randomized slices of historical data, as a proxy, to help determine model fitness.

In the example SQL that follows, four independent ideas are depicted. Together, these concepts can produce a stronger outcome than any of the ideas alone.

First, the TABLESAMPLE method uses Bernoulli sampling to identify a percentage of rows (here 5%) from the base tables. This method reduces the number of rows joined which lessens the computational burden, though it does not avoid block scans. If your database does not support this function, your outcome will not be impacted — this feature helps optimize runtime.
Second, the candidate rows are sorted in a random order using the random() function shown in the ORDER BY statement. Randomizing the order of rows can remove biases that might exist from the physical ordering of data.
Third, you should add a LIMIT clause which restricts the final number of rows to the value specified (here 100,000 rows). This LIMIT feature is especially important when you do not already sample your data by other means, as with the TABLESAMPLE method. Without TABLESAMPLE, this becomes your primary tool to control the sampled dataset size.
Last, add appropriate WHERE clauses to filter out data observations from your original tables, according to business rules. If you have no specific business rules, consider the relevancy of your data. Ask yourself: “do old records help your cause of predicting tomorrow’s outcome?” For example, in the NYC taxi cab use case, fare amounts from 2009 might not be as relevant as fare amounts from more recent years due to inflation and other factors. In this example, the WHERE clause limits your dataset to observations from the last 5 years. Filters for extreme outliers are included here too.

SELECT fare_amount, trip_distance, ...
FROM (
SELECT 
rf.fare_amount, ri.trip_distance, ...
FROM nyctaxi.ride_fare rf TABLESAMPLE bernoulli(5) 
JOIN nyctaxi.ride_info ri TABLESAMPLE bernoulli(5) 
on (rf.ride_id=ri.ride_id)
WHERE pickup_at >= date_add('year', -5, current_date)
and fare_amount > 0 -- business rule outlier removal
and fare_amount > -669.33 -- mean - 3sigma outlier removal
and fare_amount < 696.02 -- mean + 3sigma outlier removal
ORDER BY random()
) LIMIT 100000

Stratified Sampling

Stratified sampling is a strategy used to ensure a ML model’s training data represents all categories or classes evenly. This is especially important for classification problems where certain classes may be underrepresented or rare compared to others. For example, consider a use case of building a fraud detection model where only 1% of observations are fraudulent while 99% are normal transactions. Using stratified sampling, you could undersample the majority normal class or oversample the minority fraud class to balance out the training data. This helps prevent bias toward the majority class.

To implement stratified sampling, you first analyze the distribution of the target classes in the full dataset. For instance, the SQL query that follows creates statistics on the “rate_code_id” variable from the taxi dataset, showing the count and average fare for each code. The results in Figure 3 reveal an imbalanced distribution — one rate code dominates with over 32 million observations, while other codes have far fewer.

When building a classification model to predict rate codes, stratified sampling can help construct a training set that better represents all the rate code classes. Overall, stratified sampling is a useful technique for creating fair, balanced training data. It helps ensure your model learns equally well across all classes of interest.

SELECT 
rate_code_id,
count(fare_amount) as observation_count,
round(avg(fare_amount),2) as avg_fare, 
round(stddev_pop(fare_amount),2) as sd_fare
FROM nyctaxi.ride_fare rf 
JOIN nyctaxi.ride_info ri
on (rf.ride_id=ri.ride_id)
WHERE pickup_at >= date_add('year', -5, current_date)
and fare_amount > 0 -- business rule outlier removal
and fare_amount > -669.33 -- mean - 3sigma outlier removal
and fare_amount < 696.02 -- mean + 3sigma outlier removal
GROUP BY rate_code_id

**Figure 3 - Information about Rate Code Identifier (rate_code_id)**

Majority Class Downsampling
The following SQL code first randomizes the data and then assigns a row number within each “rate_code_id”, aliased as “group_row_limit”. After this step, a WHERE clause limits the number of rows per class to 75,000 observations. This downsampling causes four out of the six classes to contain approximately equal numbers of observations.

Figure 4 displays statistics on the downsampled dataset. Notably, “rate_code_id=1” has a mean of $11.76 and a standard deviation of $8.48 among the 75,000 randomized samples. These statistics closely match the original full dataset in Figure 3, with 32M samples, which had a mean of $11.81 and standard deviation of $8.52. This suggests that a model trained on a subset of the data will produce similar results to a model trained on the full dataset, while requiring less computational resources.

select
rate_code_id,
count(1) group_observation_count,
min(fare_amount) as min_fare_amount,
max(fare_amount) as max_fare_amount,
avg(fare_amount) as avg_fare_amount,
round(stddev_pop(fare_amount),2) as sd_fare_amount
from
(select 
fare_amount, vendor_id, trip_distance, rate_code_id,
row_number() over (partition by rate_code_id) AS group_row_limit
from 
(select 
fare_amount, vendor_id, trip_distance, rate_code_id
FROM 
(SELECT
rf.fare_amount, ri.vendor_id, ri.trip_distance, ri.rate_code_id
FROM nyctaxi.ride_fare rf 
JOIN nyctaxi.ride_info ri 
on (rf.ride_id=ri.ride_id)
WHERE pickup_at >= date_add('year', -5, current_date)
and fare_amount > 0 -- business rule outlier removal
and fare_amount > -669.33 -- mean - 3sigma outlier removal
and fare_amount < 696.02 -- mean + 3sigma outlier removal
) 
order by random() -- important sort here
)) 
where group_row_limit<=75000 -- arbitrary downsample
group by rate_code_id
order by rate_code_id

**Figure 4 — Information about rate code identifier (rate_code_id), after downsampling**

Minority Class Oversampling
When working with imbalanced datasets, where some classes have many more examples than others, it can be helpful to synthetically increase the number of rare class examples. One basic SQL approach is shown here and simply duplicates existing rare class rows multiple times. The original data, as an inline view, is cross joined with the UNNEST and SEQUENCE functions to duplicate each rare class row 100 times — a Cartesian product by design.

While this example can help artificially balance the class distribution, it does not add any new information. More advanced techniques like SMOTE create synthetic examples by interpolating between existing examples in your data wrangling tier. Though more computationally expensive, SMOTE adds some diversity to the duplicated rare class data. Data augmentation techniques can improve model performance on imbalanced datasets, but should be applied thoughtfully.

select
fare_amount,vendor_id, trip_distance, rate_code_id
from
(SELECT
rf.fare_amount, ri.vendor_id, ri.trip_distance, ri.rate_code_id
FROM nyctaxi.ride_fare rf 
JOIN nyctaxi.ride_info ri 
on (rf.ride_id=ri.ride_id)
WHERE pickup_at >= date_add('year', -5, current_date)
and fare_amount > 0 -- business rule outlier removal
and fare_amount > -669.33 -- mean - 3sigma outlier removal
and fare_amount < 696.02 -- mean + 3sigma outlier removal
and rate_code_id=6
) inline_view CROSS JOIN UNNEST(SEQUENCE(1,100,1)) cartesian_effect

Conclusion

This article discussed strategies for preparing data to train machine learning models, with a focus on using databases to handle large datasets. It emphasized the importance of data preparation, which can account for up to 80% of the success in supervised machine learning. The article has demonstrated techniques like excluding outliers, sampling data, and balancing classes to remove bias and improve model generalization. Specific methods covered include detecting and filtering outlier values, simple random sampling to create subsets, and stratified sampling to balance classes.

The article aims to inspire readers with proven patterns to effectively shape large datasets early, before passing them to later stages in a ML pipeline. Proper data preparation can help reduce computational complexity, time to market, and can help enable machine learning models to achieve their full potential.

Author’s comments

This is my first article on Medium. I hope it provides value to you in some way along your own journey. I welcome comments with additional questions or differing perspectives.

My next article discusses handling date datatypes during data preparation. If my writing has helped you, please consider connecting on LinkedIn or following on Medium so you are notified as new writings are released.

Disclaimers

Opinions and thoughts within are the author’s opinion and do not reflect values or endorsements of any organization, implied or otherwise.
All suggestions within are offered generically as art of possible. The reader should understand the implications of each of the recommended techniques and approach them as A/B tests to determine if the treatment improves efficiency without compromising model accuracy or business outcomes. Correct testing can help determine thresholds, split points, and quantity of data needed.
The reader may still need to extract slices of raw data to perform exploratory data analysis (EDA) to help formulate exact SQL treatments proposed within. EDA may also be helpful in determining which treatments are used and — importantly — order of operation.
Consider any path to success as a series of ordered and meaningful steps, with progress made through hypothesis testing.