Monoids

1+(2+3)=(1+2)+3

Jesse Bridgewater
3AM Data Questions
2 min readFeb 24, 2014

--

What are semi-groups and monoids and why should data scientists care?

Algebra is a big subject and most data-loving people that I have met (who are not Mathematicians) get a look of terror if you say the word semi-group. Most people can safely ignore fields, rings, groups and other topics (although they are easy too) but if you work with data you should know what semi-groups and monoids are and why your data analysis code already uses them extensively.

A lot of data science is counting really big things. And to make this scalable you want to be able to parallelize your operations. Semi-groups have associativity and that increases parallelism exponentially on the size of the input data. The dominant example is addition. Addition is associative: (1+2)+3 = 1+(2+3)=6 in the sense that you can add adjacent pairs in any order you like. Adding numbers with zero (as the identity element) gives you a monoid (is commutative as well).

Lots of cool things are semi-groups or monoids:

  • Addition of non-negative numbers
  • OR
  • MIN/MAX
  • Concatenating strings
  • lots more complicated things are are built on +,min, max, etc (hyperloglog, CM Sketch, …)

When the things you want to compute are monoids you have a lot of flexibility. For instance if you are computing something in hadoop, you can put any monoidal compontents into map-side aggregations. This is obvious for counting words, but is less obvious for hyperloglog. Being able express your problem in terms of monoids simplifies any logic for parallelizing code.

So when someone wakes you in the middle of the night and wants to know why monoids are important you just say “associativity” and go back to bed.

--

--