A Function of Parts

Published in

Perplexinomicon

6 min readJul 18, 2020

I recently lost my full time job. I’m working as a contractor, but one of the things about my life is that I’m legally obligated to have health insurance at all times, so I’m looking for a new full time job, which means I’m also looking at old code that I have hanging around on Github to see if I even know what it does or how it does it.

Happily, for the most part, I do.

Except for one function, which for reasons that are actually reasonable I named partifify. partifify/2 is part of my Morphix library, which is a tiny library of functions to help you deal with Enumerables in Elixir. I started writing it as a port of another library that I wrote several years ago in Ruby, but it’s grown since then into a fairly nice library that a surprising (read, greater than zero) number of people download and use.

Morphix has functions for getting rid of nil values, converting string keys to atoms (with safe options to keep from creating new atoms) converting atom keys to strings, flattening, checking if two lists have all the same items even if they are in different orders, and of course partifify. Most of these functions have a recursive form which is nice if you need to convert a bunch of atom keys to strings in a nested data structure, or you want to know if a nested map has all the same elements at every level and one of those levels has a list which isn’t strictly ordered.

partifify/2 takes a list and divides it into k sublists. Here’s an example:

iex> Morphix.partiphify!([1,2,3,4,5,6], 4)  
[[3], [4], [5, 1], [6, 2]]

The key thing about partifify is that the lists should be as balanced as possible. If I divide a list of 14 items into 4 sublists, I don’t want three lists of size 4 and one list of size 2. I want two lists of size 3 and two lists of size 4. For example:

iex(70)> Morphix.partiphify!([:a, :b, :c, :d, :e, :f, :g, :h, :i, :j, :k, :l, :m, :n], 4)[[:g, :h, :i], [:j, :k, :l], [:m, :a, :b, :c], [:n, :d, :e, :f]]

Why do I want balanced lists? That goes back to what I was doing when I wrote this function. The system I was building was drawing data from a number of sources, and comparing that data to a sink. This was an audit system, essentially, checking the correctness of one system against another. Because there were a number of sources, we wanted to distribute the work across several servers, but because the number of sources wasn’t fixed, we wanted to distribute this work flexibly and equally across all servers. Oh, and the number of servers was also not fixed.

So, given a list of sources, divide them up evenly to the number of servers in operation and let each server begin its auditing work. In a word, partifify/2.

That’s what we wanted to do. So how were we going to do it? I remember having the problem. I remember solving the problem. I don’t remember HOW I solved the problem.

Let’s start with the admission that it is entirely possible that there exists a function in Elixir that can take a list of 6 items and output four balanced sublists. I may just be very bad at reading documentation and missed something super obvious. I will say that I looked, and I looked hard, and I couldn’t find a simple way to do this with Enum.chunk*.

With that admission out of the way, let’s look at the basic algorithm.

The algorithm runs in a few steps.

Find the size of the smallest sublist.
Divide the list into a number of sublists based on the smallest sublist size.
If there are extra sublists, distribute them evenly across the other sublists in a round robin fashion.

Is there an easier algorithm? How about:

Make n sublists that are empty.
Distribute the original list across the n sublists in a round robin fashion.

We’ll take a look at the original implementation, and then at an alternate implementation.

Find the smallest sublist

Finding the smallest sublist is done with some Integer math:

list |> Enum.count() |> Integer.floor_div(k)

Where k is the desired number of partitions. If the result of that is greater than zero, we move on:

with chunk_size when chunk_size > 0 <- 
    list 
    |> Enum.count() 
    |> Integer.floor_div(k), 
  true <- list 
          |> Enum.count() 
          |> Integer.mod(k) 
          |> ceil_div.(chunk_size) > 0

ceil_div is a helper function that just takes the ceiling of dividing two numbers. If we look at an input of ([1,2,3,4,5,6], 4), at this point our value for chunk size is 1 and the output of the second with condition is true, because the 6 modulo 4 is 2, and the ceiling of 2/1 is 2, which is greater than zero. What we’re establishing with this code is that we’re going to have uneven sublists, and we’re going to have more than one sublist. I think the ceil_div function is unnecessary here, because the only time the value is going to be less than or equal to zero is if Integer.mod(k) is zero. chunk_size can’t be negative unless k is negative, and we shouldn’t allow that as an input. We don’t need to be able to divide lists into negative numbers of sublists.

With our smallest sublist in hand, we move to the next line:

...do      
list      
|> into_buckets(k, chunk_size)
|> distribute_extra()

These are steps two and three of the algorithm, dividing the list and distributing the extra.

Divide the list into sublists of the smallest sublist size

This is done using the following function:

defp into_buckets(list, k, chunk_size) do
chunks = Enum.chunk_every(list, chunk_size, chunk_size, [])
extra_buckets = Enum.take(chunks, -(Enum.count(chunks) - k))
k_buckets = chunks -- extra_buckets
{extra_buckets, k_buckets}
end

We use the Enum.chunk_every/4 method to split the original list into sublists based on the smallest chunk size, then we split the result into two lists… one, the extra_buckets are additional lists that we need to distribute, the other, the k_buckets, will be the sublists that we are distributing to. At this point, for the inputs ([1,2,3,4,5,6], 4), we have the values:

extra_buckets = [[5],[6]]]
k_buckets = [[1],[2],[3],[4]]

It took me a minute looking at this code to figure out what:

extra_buckets = Enum.take(chunks, -(Enum.count(chunks) - k))

Meant, but it just means: “take the additional buckets off the end of the bucket list and assign them to a variable”.

Distribute the extra sublists

So we’ve returned a tuple to our initial function that contains extra lists and sublists. We want to distribute the extra lists efficiently across the sublists.

defp distribute_extra({lists, buckets}) do
  with false <- Enum.empty?(lists) do
    [current_list | rest] = lists
    new_buckets = distribute(current_list, buckets)
    distribute_extra({rest, new_buckets})
  else
    _ -> buckets    
  end  
end

lists is an unfortunate name for the first variable in this function, because those are the extra lists. If there aren’t any, we return the buckets and we’re done. If there are extra lists, we take the first extra list and distribute it across all the buckets, then call this function recursively with any remaining extra lists. In our case of making 4 sublists from the list [1,2,3,4,5.6], we call this function with {[[5],[6]], [[1],[2],[3],[4]]}. So the first call to distribute will be with [5], [[1],[2],[3],[4]].

defp distribute(list, buckets) do
  Enum.reduce(list, buckets, fn item, buckets ->
    [current_bucket | rest_of_buckets] = buckets
    new_bucket = [item | current_bucket]
    rest_of_buckets ++ [new_bucket]
  end)
end

Step one of the reduction is to grab the first bucket off of our list of buckets. In the example we’re using, the first bucket is [1]. Then we take the current item (in our example, 5) and add it to the front of the bucket. That bucket is moved to the back of the list, so on the next round of the reduction, the accumulator will be [[2], [3], [4], [5, 1]]. The function returns the accumulator when it has processed all the items in the list, and the recursive distribute_extra function calls itself back with any remaining extra lists and the new buckets. The first recursion will be called with:

{[[6]], [[2], [3], [4], [5, 1]]}

and the last call will be made with:

{[], [[3], [4], [5, 1], [6, 2]]}

That last all will terminate the recursion and return our completed buckets.

Why do things this way?

Looking back, we had a couple of motives for writing the code in this way. One was just that this is how it evolved and we never got around to refactoring it. We started trying to use Enum.chunk_by and a lot of decisions resulted from that initial idea. Another is that ++ is an expensive operation, and we wanted to minimize its use. In the simple example we just went through, ++ is called twice. If we naively used ++ to implement a fully round robin solution based around distribute and distribute_extra, there are always n calls to ++ . That probably is a premature optimization, though, because with practical numbers of lists that’s just not enough calls to matter.

It’s probably worth re-examining this code and trying to refactor it. If I were to encounter this library code as a third party, I don’t know that I’d be impressed with this implementation at all. I wrote it, and looking back, I have questions.

A Function of Parts

Find the smallest sublist

Divide the list into sublists of the smallest sublist size

Distribute the extra sublists

Why do things this way?

Written by J Paul Daigle