All you need is less. Rethinking big data

Published in

Sanoma Technology Blog

4 min readMay 29, 2017

PyGrunn in Groningen

Python is alive and kicking in Groningen. PyGrunn is a Python conference in the beautiful city of Groningen. The conference describes itself as the largest Python Conference in the Netherlands. I had to be there, since I really enjoy writing Python. And I used to study in Groningen, so I’ll take any excuse to go there.

This particular talk was one from Berco Beute, the CTO from Paylogic. It wasn’t a technical talk in the traditional sense. It was more a meta talk on the usage of data in software systems.

In this article, I’ll first walk you through the tech talk and give you my take on it afterwards.

Data is not information

Berco started off with a great Frank Zappa quote. To illustrate that data without context is meaningless.

Data is not Information. (BB)
Information is not knowledge.
Knowledge is not wisdom.
Wisdom is not truth.
Truth is not beauty.
Beauty is not love.
Love is not music.
Music is the best...

At this moment, data is being generated in enormous quantities. There are web applications and mobile apps. The ML/AI-movement, where enormous amounts of data are being rearranged for processing. IoT adds a new dimension to it all, by having local devices scattered all over the place. And of course, there are the brick-and-mortar stores with their in-house databases.

Data specific problems

We live in a time where it has become the norm to hoard data.

Berco states a few problems that are inherent with data. First of all, the costs. It takes a lot of money to collect, process, store and maintain all this data. Secondly, data is vulnerable. It can be stolen, abused via ransomware or manipulated. Besides this, having access to so much data, it can impair your vision. Where is that needle in the haystack?

Shifting context can blur your vision. Remember the game you used to play in kindergarten? A classroom of children sitting in a circle. One of them whispers a word in the ear of the child besides him and (s)he passes it onto the next child. The last child never ends up with the same word.

The point to take away here, is that data is a separate entity and needs to be handled with care.

What would be a better way to deal with data?

Telecosm talks about the following. Given enough bandwidth and low enough latency, storage is not needed. We should stop copying. Build systems with as minimal information as possible. Optimize process instead of being data-centric.

Think in terms of contact based systems instead of data based system. For example: When buying alcohol, a bartender needs to know whether you are 18 years or older. He doesn’t need to see your entire ID, with your place of birth, social security number and so on. Let’s say you do this online. You can give a shop permission to look up data on your age. So they don’t need to store this data.

A few advantages would be that there are no local copies of personal information. It also gives end-users more ownership about his data. (S)he can retract that permission any time.

My take

What I see this coming down to, is having containers of data. We, as end user, have a dashboard and control who gets access. This sounds like a great goal. I would like to be able to revoke access via one dashboard! It would make all those invisible data streams apparent.

So let’s try and see what this will look like. The internet is a system without borders, this means an American company does not need to comply with Dutch law. At this moment, Google has probably way more data about Dutch citizens than the Dutch government has. In America there is pretty much no privacy legislation, whereas Europe has pretty strict privacy legislation. In other words you’ll need to align different policies in different countries. Or we need to stop using services which do not comply with this contract based thinking.

Let’s assume you would overcome that hurdle. The internet agrees that data should be stored in silos. Now you need to convince businesses not to store any data which passes through their software systems. Which can be kind of hard for a bunch of reasons.

Marketing firms who mine personal data have little incentive to participate.

Another one would be data science firms, those require even more data than the current web applications. Let’s say you run a map reduce job with Spark. You’ll need lots of data as input. Which you can collect via API’s and you’ll need to store that locally, before you can process those TB’s. All that data will enter your Spark machine. How do you deal with temporarily stored data there? On an even lower level, that data will be stored in physical memory. It’ll take a very different attitude towards software systems to prevent this kind of leakage.

Conclusion

I think it’s a good idea to think in terms of contracts. It makes websites more API driven and gives companies more focus on it’s own business and gives end user control over their own data. To kickstart the contract based approach, I’d say raising awareness is the first priority. Secondly, a broadly supported manifesto is needed as a guideline for businesses and developers. Thirdly, start building a better tomorrow.