Collatz sequence lengths of the first 10 million integers in Apache Spark

Chi Yan Tang
2 min readNov 23, 2021

--

I am deeply fascinated by intractable mathematical problems, and the Collatz conjecture is one of them.

Take an integer n. If n is even, divide it by two (n ← n/2). If n is odd, multiply this number by 3 and add 1 (n ← 3n+1). The result of this operation is the input of the next iteration.

For example, if the initial value of n is 3, the Collatz sequence is [3, 10, 5, 16, 8, 4, 2, 1]. The Collatz sequence length is 8.

The unproven Collatz conjecture specifies that the sequence will always return to 1, for any integer n. It is widely speculated that this conjecture is true, but no formal mathematical proof exists.

I was aware of this problem even as a teenager, and really enjoyed checking if the sequences always reached 1 from a random number (both in paper and in BASIC), and it always did! The problem continues to fascinate me even today.

Paul Erdos has famously said, “Mathematics may not be ready for such problems.”

I have plotted a frequency distribution of the length of Collatz sequences in Apache Spark. This is a link to the Jupyter Notebook.

import pyspark 
import random
import time
from pyspark import SparkContext
sc = SparkContext()
start_time = time.time() MAXNUMBER = 10000000 def collatz_length(x):
length = 1
while x > 1:
if x % 2 == 0:
x = x/2
else:
x = 3*x + 1
length += 1
return length
for partitions in [1,2,4]: #testing code execution time over different partition sizes
# Collatz sequence length frequencies
frequenciesRDD = sc.parallelize(range(1,MAXNUMBER), partitions)
frequencies = frequenciesRDD.map(lambda x:(collatz_length(x),1)) \ .reduceByKey(lambda x, y: x + y).collect()
print("Execution time: %s seconds - %s partition(s)" % (time.time() - start_time,partitions))
start_time = time.time()

The execution times drops with an increase in the number of partitions.

Execution time: 546.4677586555481 seconds - 1 partition(s)
Execution time: 337.4857029914856 seconds - 2 partition(s)
Execution time: 257.45519518852234 seconds - 4 partition(s)

The code plots a frequency distribution of the Collatz sequence lengths for the first 10 million integers.

The distribution has an interesting but unexplained pattern…

Do I think the Collatz conjecture is true? Does the sequence always return to 1? My gut feeling is that it always does… But I am an ordinary mortal and I can’t prove this mathematically.

Ref: This fun video explains the Collatz Conjecture.

Thank you for reading! You can drop me some feedback or reach out to me at chi@chiyan.info

UPDATE: The notebook and a draft of this blog post was authored in 2017. I am aware that the length distributions can be calculated more efficiently with dynamic programming, rather than “brute-forcing” the calculations in Apache Spark clusters.

--

--