How to get a random line from a file? — One Pass Solution Using Python

Suraj Regmi
Dec 7, 2019 · 3 min read

When a random line is to be selected from a small file, we can do it pretty quickly, even with a naive algorithm. As the small file can be loaded into RAM, the selection of a random line is straightforward. But for the files of the magnitude of terabytes and petabytes, such naive approach doesn’t work. RAM can’t load the whole file at once so we need to read it from the storage devices.

Photo by Safar Safarov on Unsplash

Obvious Approach

Reading the file from the storage devices, one obvious approach would be to do two passes. The first pass would count the number of lines in the file. Then one line number at random is selected. After the line number is selected at random, we can do the second pass to select that line. This way we can select the random line on an average of one and a half passes. But, can we do better?

One Pass Solution

As it turns out, we can do better. We can get a random line from a file in just one pass.
In the only pass, we probabilistically choose the line such that the probability of choosing all the lines read until now is the same i.e 1 / number of lines at current.

The algorithm is as follows:

  • Initialize the random line and the line number as an empty string and 0 respectively.
  • Traverse lines one by one and if the random decimal number between 0 and the line number is less than or equal to 1, substitute the current line as the random line. Increase the line number in each iteration.
  • Return the random line.

Proof of Correctness

Let’s see the proof of correctness of the algorithm. We do the proof using the method of induction.

For n = 1, the first line, the random number generated between 0 and 1 is always less than or equal to 1. So, it would be selected as the random line. As it is the only line in the file, it is the correct random line.

Let’s suppose the algorithm gives the correct random line after m lines. That means coming to the mth line, all the ‘m’ lines have equal probability to be picked as the random line i.e 1 / m.

Now, the (m+1)th line is picked if the random number between 0 and m+1 is less than or equal to 1. So, the probability of the (m+1)th line being picked is equal to 1 / (m+1).

The previous m lines have an equal probability of being picked i.e 1 / m. The random line picked before, l, which has 1 / m probability of being picked, will be picked again if the (m+1)th line is not picked. As the probability of (m+1)th line not being picked is m / (m+1), the probability of the line l being picked now becomes:
1 / m * m / (m+1) = 1 / (m+1)

As all the lines have the probability of being picked equal i.e (1 / m+1) coming to (m+1)th line, the correctness of the algorithm is proved.

Python Solution

We saw the algorithm and its proof of correctness. Let’s now see the code solution using Python.

Isn’t it awesome?

Mathematical vs Algorithmic Beauty

The mathematical beauty pervades the algorithmic beauty here. The algorithm is fairly simple but the mathematical trick applied here to make the probability of picking all the lines the same is wonderful. Also, the proof of correctness is easy to follow and intuitive. So, for someone wanting to learn proofs by induction, this can be a really good problem to stumble upon.

The Startup

Medium's largest active publication, followed by +566K people. Follow to join our community.

Suraj Regmi

Written by

Data Scientist, The World Bank

The Startup

Medium's largest active publication, followed by +566K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade