Manipulating data like a jellybean

Jon Kleiman
4 min readOct 15, 2020

--

There are five ways to manipulate data*. Let's dive into them.

If you had a jar of different coloured jellybeans and I asked you how many are red. What would you do?

After emptying them out on the table you would FILTER out only the red ones and discard (or eat) the rest.

Then you would COUNT how many are left. Seeing that these are all red you have your answer.

Don't worry too much about why I put some words in capitals. This will become clearer later. All you need to know is that you performed two data manipulation methods - filtering data and then counting them.

Let's move onto level 2. Same scenario, but now you have two jars of jellybeans. I want to know how many red beans in each jar - let's call the jars Jar_A and Jar_B.

You would do the same thing as before once for each jar and have a count for Jar_A and then for Jar_B. What you're doing is GROUPING the red count BY each jar.

Let's look at that one another way. Another way to solve this is you call two of your friends. Bob and Stacy come over and because you're lazy you give Bob Jar_A and you give Stacy Jar_B. You tell them to each count the red jellybeans in their jars and give you the count.

Let's say Bob comes back with 75 red jellybeans and Stacy with 85.

You could then draw up a little table:

Amazing, how does that feel? You may not see it but you just did something incredible!

You have in front of you your first data summary.

A short recap before we move on. We learnt about filtering (ie choosing only the red jellybeans); counting; and thirdly about grouping by something (here we grouped by the jar's name).

Let's move onto level 3. After seeing your jellybean skills, you land a job at Patricks Sweet Factory. Your boss brings you a table like the one we've seen before:

He wants to know why there are so few red jellybeans in Jar_D. You suspect whoever did the calculation for Jar_D ate some of the red jellybeans first!

With one table of information we can't do much. You go speak to the factory forman and ask for a list of who worked on each jar. This is what you get back:

You don’t have a list of who did each jar but you now know where each employee sits and their names.

You're getting somewhere detective! The forman agrees to bring you another table to your office.

An hour later you get a table that you can finally use to solve the mystery.

Can you solve the puzzle?

Using the two tables you JOIN them to create one big table.

You can now see that Zack worked on Jar_D. Before you go tell your boss and get that promotion let’s just dive deeper into what you did.

You saw that both tables have seat number as a common column. The key to joining these tables correctly is then to match based on seat number. You could have started with either table. We'll begin with the floorplan with the names. You look at the seat number in the first row. It says A2. You then look on the other table for A2 and see that the jar is Jar_A. Doing the same thing for all seat numbers gives you the column for the jar name.

Congrats, you're onto the last level for this lesson. Let's move onto level 4.

The final piece* of data manipulation logic to understand is ordering.

Going back to this table we had above:

You could ORDER or SORT the table by which jar had the most red jellybeans.

Simple right?! We just put the rows in an order that’s easier to read.

And that's it! You have the foundation to move on to learning a programming language. In the next post we're going to take these learnings and translate them into SQL, a popular database querying language. That just means it's a way data scientists tell the computer how to manipulate data.

See you then!

*This is a series based on a need to know basis. I progressively add more info with each step. Data science can be overwhelming, and learning in baby steps helps the process.

--

--