Day 100: Segmented Eratosthenes sieve

Tomáš Bouda
100 days of algorithms
5 min readJul 2, 2017

--

For the 100th algorithm I chose a segmented Eratosthenes sieve for primes up to 10⁹ implemented in Cython. And this time the goal is to provide an algorithm that is as fast as possible.

There are several reasons why I diverged from Python and the usual type of implementation. Among others, I like breaking rules and I like optimizations. But what’s most important…

In most of the articles I have tried to provide as short and nice implementation as possible. I was aware that the algorithm could have been improved, but I didn’t do it since I would have to add another 5 or 10 lines. And here is my reasoning.

I have seen many people talking about time complexity and witnessed a common [mis]understanding of how asymptotical notations work. I have also seen many who do understand very well and [unlike textbooks] their evidence supports mine.

In theory, analysis of Big-O notation starts with a theoretical computational model beneath the algorithm to study its behaviour in relation to data. With respect to algorithms, f(n) = O(g(n)) states that for large enough data algorithm f requires at most number of steps required by algorithm g. And f can be divided by any positive constant to make the claim to be true.

If you are interested, check my Amortized Algorithm Analysis to see a small example of complexity analysis.

In practice, however, Big-O notation works a different way. Given two algorithms solving the same problem, there are [a common] data that O(n.log n) algorithm will perform worse than O(n²) algorithm. And it is not unlikely that it is the case for most of your data.

I have already touched the topic in this series. It is [almost] impossible to write a complex algorithm that performs best under any scenario. Standard libraries are usually written to perform well in expected situations and to survive the unexpected.

If, for example, you need to sort data on regular basis and your array has few tens of items, built-in sort is not the best algorithm. You can do many times faster, but the total time is negligible and you don’t have to care anyways. That means to perform well.

If you need to sort millions of records on regular basis, calling built-in sort is likely the worst thing you can do. You better use a different data structure like binary search trees to avoid batch sorting at all.

Or maybe not. My own evidence shows that the best performing algorithm is usually the one you would expect least. You simply have to try and give it a chance.

In practice, complexity analysis always starts with a profiler.

Write a referential implementation and let the profiler show you any bottlenecks. Think about improvements, improve and profile again. And again.

That’s how I implemented segmented Eratosthenes sieve. I identified critical functions, wrote several implementations for each one of them and checked the results.

The main idea behind segmented sieve is that modern CPU can’t address a single byte and not even an integer. CPU has several layers of cache and whenever you address anywhere in memory, it has to fetch the block and store it inside cache, first.

This takes a lot of time and Eratosthenes sieve addressing wide consecutive areas of memory is simply wasteful. Fortunately, the algorithm can be rewritten to sieve primes in local segments that fit inside CPU cache.

The gain is enormous and my implementation is able find number of primes below 10⁶ in time about .6 ms. Number of primes below 10⁹ can be found in about .8 s. For a reference, my machine is MacBook Pro 2014, 2.2GHz i7.

If you are using Anaconda distribution, the notebook will work fine. To run the algorithm outside of notebook, follow these steps.

  1. make sure you have Cython installed
  2. create file day100.pyx and put the code inside; do not forget to remove %%cython directive
  3. create and run file main.py with the following code
import pyximport; pyximport.install()
import day100
print(day100.eratosthenes(10 ** 9))

https://github.com/coells/100days

https://notebooks.azure.com/coells/libraries/100days

algorithm

%%cythonfrom libc.stdlib cimport malloc, freeDEF LIMIT = 1024 * 31
DEF PRIME = 1024 * 4
DEF SIEVE = 1024 * 32
cdef inline int imin(int a, int b) nogil:
return a if a < b else b
cdef inline int memset(char *p, int n) nogil:
cdef:
short *q = <short *>p
int i, j = 0
for i in range((n + 1) >> 1):
j += q[i]
q[i] = 0x0100
return j >> 8cdef int naive_sieve(char *sieve, int *primes, int *offsets,
int n) nogil:
cdef int i, j
memset(sieve, n) for i in range(3, n, 2):
if sieve[i]:
j = i * i
while j < n:
sieve[j] = 0
j += i << 1
primes[0] = i
offsets[0] = j
primes += 1
offsets += 1
primes[0] = 0
offsets[0] = 0
return memset(sieve, n)cdef int segmented_sieve(char *sieve, int *primes, int *offsets,
int k, int n) nogil:
cdef int i
while primes[0]:
i = offsets[0] - k
while i < n:
sieve[i] = 0
i += primes[0] << 1
offsets[0] = i + k
primes += 1
offsets += 1
return memset(sieve, n)cpdef int eratosthenes(int n) nogil:
cdef:
char *sieve
int *primes
int *offsets
int k, total
if n > LIMIT * LIMIT:
return -1
sieve = <char *>malloc(SIEVE)
primes = <int *>malloc(PRIME * sizeof(int))
offsets = <int *>malloc(PRIME * sizeof(int))
total = naive_sieve(sieve, primes, offsets, imin(n, LIMIT)) memset(sieve, SIEVE)
k = LIMIT
n -= LIMIT
while n > 0:
total += segmented_sieve(sieve, primes, offsets,
k, imin(n, SIEVE))
k += SIEVE
n -= SIEVE
free(sieve)
free(primes)
free(offsets)
return total

run

for i in range(1, 10):
print('primes below 10**%d: %d' % (i, eratosthenes(10**i)))
>>>primes below 10**1: 4
primes below 10**2: 25
primes below 10**3: 168
primes below 10**4: 1229
primes below 10**5: 9592
primes below 10**6: 78498
primes below 10**7: 664579
primes below 10**8: 5761455
primes below 10**9: 50847534

--

--