How to get a random line from a file? — One Pass Solution Using Python
When a random line is to be selected from a small file, we can do it pretty quickly, even with a naive algorithm. As the small file can be loaded into RAM, the selection of a random line is straightforward. But for the files of the magnitude of terabytes and petabytes, such naive approach doesn’t work. RAM can’t load the whole file at once so we need to read it from the storage devices.
Reading the file from the storage devices, one obvious approach would be to do two passes. The first pass would count the number of lines in the file. Then one line number at random is selected. After the line number is selected at random, we can do the second pass to select that line. This way we can select the random line on an average of one and a half passes. But, can we do better?
One Pass Solution
As it turns out, we can do better. We can get a random line from a file in just one pass.
In the only pass, we probabilistically choose the line such that the probability of choosing all the lines read until now is the same i.e 1 / number of lines at current.
The algorithm is as follows:
- Initialize the random line and the line number as an empty string and 0 respectively.
- Traverse lines one by one and if the random decimal number between 0 and the line number is less than or equal to 1, substitute the current line as the random line. Increase the line number in each iteration.
- Return the random line.
Proof of Correctness
Let’s see the proof of correctness of the algorithm. We do the proof using the method of induction.
For n = 1, the first line, the random number generated between 0 and 1 is always less than or equal to 1. So, it would be selected as the random line. As it is the only line in the file, it is the correct random line.
Let’s suppose the algorithm gives the correct random line after m lines. That means coming to the mth line, all the ‘m’ lines have equal probability to be picked as the random line i.e 1 / m.
Now, the (m+1)th line is picked if the random number between 0 and m+1 is less than or equal to 1. So, the probability of the (m+1)th line being picked is equal to 1 / (m+1).
The previous m lines have an equal probability of being picked i.e 1 / m. The random line picked before, l, which has 1 / m probability of being picked, will be picked again if the (m+1)th line is not picked. As the probability of (m+1)th line not being picked is m / (m+1), the probability of the line l being picked now becomes:
1 / m * m / (m+1) = 1 / (m+1)
As all the lines have the probability of being picked equal i.e (1 / m+1) coming to (m+1)th line, the correctness of the algorithm is proved.
We saw the algorithm and its proof of correctness. Let’s now see the code solution using Python.
Isn’t it awesome?
Mathematical vs Algorithmic Beauty
The mathematical beauty pervades the algorithmic beauty here. The algorithm is fairly simple but the mathematical trick applied here to make the probability of picking all the lines the same is wonderful. Also, the proof of correctness is easy to follow and intuitive. So, for someone wanting to learn proofs by induction, this can be a really good problem to stumble upon.