Randomizing Data In SQL
Generate randomized test/train datasets in SQL using NASA JPL dataset
Sequel can be used to fetch random records from a given dataset. One can easily use the
rand() function to generate random floating point values between 0 and 1. I recently wrote about how to generate test data using just SQL. With this function, a lot of useful things can be done. For example,
- Disordering a dataset (or randomly ordering a dataset)
- Generating a random number & a random number within a given range
- Splitting a dataset into two or more parts (for testing & training)
- Generating random strings (when used in conjunction with
Picking a Random Satellite from a Random Planet
A simple example showing the use of
rand() in a subquery — in this example, the purpose is to pick a random satellite from a random planet every time the query is run. This, however, doesn’t guarantee that two subsequent results of this query aren’t the same. The query obviously doesn’t have any memory of what the result on the last run was.
Pick A New Planet/Satellite Is Every time (Random & Unique)
The answer is simple, create a database object which remembers the last result. In this case, we’re creating a table called
temp_l_satellite. You can create this table on disk or in memory. See that I have used two defensive coding techniques wherein I have defined a primary key in the temporary table and I have also used
not exists condition in the
where clause, it’s not possible to get duplicates in the table. Still, it’s always best to make the query as fool proof as possible.
insert ignore won’t throw an error if we try to insert a duplicate record into the table. Duplicate will be checked only based on the
id as it is the primary key.
Generating Test/Train Data From A Given Dataset
A split would require
rand() operation on the complete dataset. For the sake of having the data really, truly random, we’ve used two queries where data is being filtered based on the output of
rand(). The result of
union of both the datasets gives us a subset of the complete dataset, which is then ordered by
rand() to give it another degree of randomness. To make sure that all the records are unique, we’ve used
limit 100 is hardcoded in the above query but it can be parameterized if we use a stored procedure. MySQL doesn’t allow passing a
@limit_num variable to the
The use of
rand() comes with a cost. Ordering by
rand() in a production environment is never recommended as it is one of the slowest queries you can run on a SQL database. A random order implies that the function would have to go through all records and assign them a random number based on some internal assignment mechanism. But as mentioned above, it does solve for some very specific use cases. Use it, but be careful.