Cheating at a Company Group Activity Using Unix Tools

Boris Churzin
Fundbox Engineering
7 min readDec 8, 2021

The tools you didn’t know you need

Unix tools. I don’t need them all the time. Sometimes months can pass, and they’ll just sit there in the background, untouched. But when I do need them, it’s just like with that $10 screwdriver set I have in some forsaken drawer at home, and it becomes the best compulsive purchase I made.

Of course, if I wouldn’t have these tools, I would still be able to deal with the task at hand. I can always write some Python code that reads, transforms, filters, and eventually writes the data the way I want. But programming feels like an effort, and it now becomes this task, a job - something that has data structures, abstraction levels, quality. Yes, I can write some dirty code, but I wouldn’t enjoy it, so I keep it as a last resort. With Unix tools, on the other hand, it’s all about the flow, about getting this specific thing done, enjoying the process, and forgetting all about it a minute after it’s finished.

The tools

The only tools you need for data manipulation:

  • A search engine of your choice
  • Basic regex skills
  • Knowledge of how Unix I/O works
  • The basic commands: ls, cat, echo, head/tail
  • Not so basic commands: grep, sed, cut, xargs
  • Advanced commands that you kinda remember their names but not how to use them: curl/wget, awk, jq, ImageMagic, GraphViz

I’ll assume you know how to use a search engine and have at least some knowledge in basic regex (it’s a big assumption, I know, but regex is so ubiquitous and valuable that if you don’t know it yet, you should learn it).

First of all, let’s cover the basics (skip the parts you already know).

An example of simple tools with pipes

Let’s say we have a bunch of CSV files formatted as: product,calories,addictiveness and we want to retrieve the values for cakes only:

ls | grep '.csv$' | xargs cat | grep 'cake' | cut -d, -f2,3 > cakes.csv

let’s break id down:

  • |: streams data from one command to another
  • ls: prints files names
  • grep '.csv$': filters the names only for those that end with .csv
  • xargs cat: prints each such file line by line
  • grep 'cake': filters the lines for those that include cake
  • cut -d, -f2,3: splits each line into columns using , delimiter and prints only the second and the third columns
  • >: writes the result into a file

Let’s cover most of the tools you’ll ever need; there aren’t many:

Unix I/O

The only I/O controls you need to know: |, >, >>

  • |: AKA the pipe, if placed between any commands it takes the output of the first and “pipes” it line by line into the other
  • >: AKA the redirect, if placed between a command and a file name - it takes the output of the command and writes it into the file
  • >>: AKA the append, same as a redirect but appends to the file instead of overwriting it

Basic commands

  • ls: print files names in a directory
  • cat: print contents of a file
  • echo: print some text
  • head/tail: print first/last lines

Not so basic commands

  • grep: filters input using regex
  • sed: applies regex on input (e.g., search and replace)
  • cut: cuts input into columns using a delimiter
  • xargs: runs a given command on each line of input

Now let’s move on to a much more practical example:

Cheating at a supposedly fun group challenge at work

Our (the best!) HR department came up with this challenge:

  • Take your first name, last name, and birthdate
  • Put it into some given URL e.g. http://some.throwaway.host/{firstname}_{lastname}_{month}_{day}.png
  • Download the image, it’s a jigsaw piece, and it has some color scheme
  • Find other participants with images of the same color scheme
  • Solve a jigsaw puzzle, scan a QR code, get a present

Simple borderline trivial, right? Right?… No. Some people are busy; some don’t read emails; some are not interested. My moral code tells me it’s OK to cheat in this situation.

Getting the data

First, we need to get our hands on the missing data, i.e., names and dates. Like many companies, we have a portal to check information about employees. After some digging, it was easy to find the HTTP request that pulled this information from the server. And it even had all the birthdates in the JSON! We are good to go!

The JSON:

{"employees": [
{
"firstName": "John",
"lastName": "Smith",
"personal": {
"shortBirthDate": "12/31/1980"
}
},
...
]}

When it comes to JSON, we need the “advanced commands that you kinda remember their names but not how to use them”, in this case: jq.

It’s the most complicated command in this post, and it takes some time to dig around for the right syntax you need. The result:

cat employees.json | jq '.employees | .[] | (.firstName + "_" + .lastName + "_" + .personal.shortBirthDate)'

Let’s break it down:

  • cat employees.json | jq: print the file and pipe it line-by-line into jq
  • .employees |: (this is where jq syntax starts, it’s similar to pipes in Unix) take the contents of the employees field and pipe it, the result is the array []
  • .[] |: take the array and pipe it line-by-line, the results are the {} entries inside the array one-by-one
  • .firstName + ...: retrieve the data we need and concatenate it with _

The result of this command are lines that look like: John_Smith_12/31/1980

Manipulating the date

The date is not in the format we need it to be. That sounds like a good use for a regex:

jq ... | sed -E 's/([0-9][0-9]).([0-9][0-9]).[0-9]*$/\2_\1/'

Let’s break it down:

  • sed -E: apply Extended regex (extended means we can use special characters like ()[] without the need to escape them)
  • ([0-9][0-9]): find two digits one after another, call it a match \1
  • .([0-9][0-9]): skip a character and do the same, call it a match \2
  • .[0–9]*$: skip a character and the rest of the digits
  • /\2_\1/: replace the matched part of the line with \2 then _ then \1

The result is lines that look like this:John_Smith_31_12

It’s time to download some images

Now that we have all the names with the dates in the correct format, we can start downloading:

sed ... | xargs -I{} wget http://some.throwaway.host/{}.png

Let’s break it down:

  • xargs -I{}: on each line in the input, run the given command and replace {} with the contents of the line
  • wget http://some.throwaway.host/{}.png: download a file from a URL

The result will be all of the images scraped in the current folder.

The full command

cat employees.json |\
jq '.employees | .[] | (.firstName + "_" + .lastName + "_" + .personal.shortBirthDate)' |\
sed -E 's/([0-9][0-9]).([0-9][0-9]).[0-9]*$/\2_\1/' |\
xargs -I{} wget http://some.throwaway.host/{}.png

Now we can just look at the images, search them with eyes for the color scheme we need and copy somewhere else.

Just kidding, manual labor is too hard!

Sorting by color

Let’s put the images into folders according to their primary color.

It sounds like we need another command we don’t remember how to use: ImageMagick (or specifically its command convert). It knows how to manipulate images and maybe how to query for colors? Digging around the web gives this direction:

convert file.png -colorspace RGB -format %c histogram:info:-

I don’t even want to know what all these arguments mean; all I care about is the result. Running this on a random image produces around 140 thousand lines of these:

11: (255,252.731,255) #FFFDFF rgb(100%,99.1102%,100%)
1: (255,255,252.731) #FFFFFD rgb(100%,100%,99.1102%)
344: (255,255,255) #FFFFFF rgb(255,255,255)

The first value is the number of pixels of a specific color, #FFFFFF part is the color itself.

I won’t get into the details of this part, but let’s go over it quickly:

ls | xargs -I{} sh -c "convert {} -colorspace RGB -format %c histogram:info:- | grep -v FFFFFF | grep -v 000000 | sort -n | tail -1 | grep -Eo '#[^ ]*' | sed 's/#//' | xargs -III mv {} II"
  • ls: list the file names
  • xargs -I{} sh -c: on each line, run the script (sh -c is a useful way to run multiple commands without creating a script file)
  • convert: get the colors
  • grep -v FFFFFF | grep -v 000000: -v is for inverse, exclude lines with these patterns (we don’t want any white or black colors)
  • sort -n: sort numerically
  • tail -n 1: take only the last line (the most frequently seen color)
  • grep -Eo '#[^ ]*' | sed 's/#//': find only the color value without #
  • xargs -III mv {} II: run mv (move) command, notice -III instead of -I{} , {}is already taken, so we use another placeholder instead.

Or in other words: for each file | find the colors | pick only the most frequent one that is not white or black | move the file into a folder with the color name

Was it worth it?

This whole process took 21 minutes:

  • 10 minutes to figure out the right place to get the data from (we have two HR portals, one of them didn’t have the data)
  • 5 minutes to scrape the images out of which 3 minutes to write the jq query
  • 5 minutes to wrap the ImageMagick line found on the web
  • 1 minute to zip the results and distribute them among the (willing to cheat) activity participants

Would I be able to do the same with python or ruby? Yes, although it would take much more time, and I would enjoy it much less.

The speed comes from the fact that all these tools are simple and know how to do one thing and do it well. You just need to know they exist and some experience to know when to use which.

So was it worth it? The result is useless in any sense. But I enjoyed the process, and this is what counts.

--

--