
DATA ENGINEERING
Randomizing Data In SQL
Generate randomized test/train datasets in SQL using NASA JPL dataset
Sequel can be used to fetch random records from a given dataset. One can easily use the rand()
function to generate random floating point values between 0 and 1. I recently wrote about how to generate test data using just SQL. With this function, a lot of useful things can be done. For example,
- Disordering a dataset (or randomly ordering a dataset)
- Generating a random number & a random number within a given range
- Splitting a dataset into two or more parts (for testing & training)
- Generating random strings (when used in conjunction with
md5()
There are a lot of other interesting use cases where rand()
fits in. For the scope of this writeup, we’ll use a planet & satellite dataset hosted by Devstronomy. Here’s the DDL for the two tables —
Picking a Random Satellite from a Random Planet
A simple example showing the use of rand()
in a subquery — in this example, the purpose is to pick a random satellite from a random planet every time the query is run. This, however, doesn’t guarantee that two subsequent results of this query aren’t the same. The query obviously doesn’t have any memory of what the result on the last run was.
Pick A New Planet/Satellite Is Every time (Random & Unique)
The answer is simple, create a database object which remembers the last result. In this case, we’re creating a table called temp_l_satellite
. You can create this table on disk or in memory. See that I have used two defensive coding techniques wherein I have defined a primary key in the temporary table and I have also used insert ignore
.
With the not exists
condition in the where
clause, it’s not possible to get duplicates in the table. Still, it’s always best to make the query as fool proof as possible. insert ignore
won’t throw an error if we try to insert a duplicate record into the table. Duplicate will be checked only based on the id
as it is the primary key.
Generating Test/Train Data From A Given Dataset
A split would require rand()
operation on the complete dataset. For the sake of having the data really, truly random, we’ve used two queries where data is being filtered based on the output of rand()
. The result of union
of both the datasets gives us a subset of the complete dataset, which is then ordered by rand()
to give it another degree of randomness. To make sure that all the records are unique, we’ve used union all
.
The clause limit 100
is hardcoded in the above query but it can be parameterized if we use a stored procedure. MySQL doesn’t allow passing a @limit_num
variable to the limit
clause.
Beware
The use of rand()
comes with a cost. Ordering by rand()
in a production environment is never recommended as it is one of the slowest queries you can run on a SQL database. A random order implies that the function would have to go through all records and assign them a random number based on some internal assignment mechanism. But as mentioned above, it does solve for some very specific use cases. Use it, but be careful.