The Coming Unification of Big Data

6 min readOct 6, 2015

Today, data is prevalent. That statement is vacuous, yet acceptable. Data has always been prevalent, but in recent years we have become more adamant in its collection, and creative in our methods of analysis. In fact, there is so much data that is around us that we still aren’t collecting, that we can hardly begin to know what we don’t know.

In a perfect world, with complete collection of all available data and the proper algorithms, given a situation where only one individual has access to all of this data, this individual could realistically predict thousands of events of lilliputian scale everywhere in the globe. Today, it is easier to predict whether the leader of one country will command his military to go to war with another country than it is to predict whether Bob will choose an omelette or oatmeal for breakfast tomorrow morning.

Naturally, this is a result of the predictive models we’re using. Our access to data is extraordinarily limited in both cases. As for the leader of the country, we can look at his domestic political situation, his nation’s needs, and how much it would benefit him to go to war versus how much it could afflict him. He will typically have a cabinet of advisors, and we can generally expect him to make the most rational decision to benefit himself at that time. As for Bob, Bob’s rather indecisive. He had an omelette yesterday, a croissant today, but he’s not sure about tomorrow. ‘I’ll cross that bridge when I get there’ Bob thinks to himself.

Bob has heart problems and shouldn’t have much cholesterol, but he just loves his eggs. The next morning, his child feels ill, so Bob skips breakfast and takes his child to the doctor.

There are a million variables at play, but given our increasing capacity to account for them, we could much more closely approximate what might happen to us on a daily basis. For instance, assume Bob’s child walked through a crowded city ten years in the future, with ubiquitous sensors and the total surrender of privacy. Wearables are prevalent, and detect whether or not their wearer has become ill. Their walking path through the city is tracked via their cell phone GPS, while security cameras observe their locations of coughing and sneezing. Meanwhile, Bob’s child also has his walking path tracked, and given his collision with the paths of sick individuals and their airborne pathogens from their coughing and sneezing, as well as his health records and predisposition for illness, a probability for Bob’s child becoming sick could be generated at the end of the day.

This sounds Orwellian. That’s because it is. But Orwell focused intensely on how omnipresent surveillance tools could hurt the citizenry and help oppressive regimes. These tools and sensors aren’t going away, so we should begin to focus on how these mass-monitoring data-collection systems can begin to help everybody. A first step towards this end is to end the compartmentalization of big-data.

What I mean by this is that currently, each major company has its own subset of data, accessible only to itself and its advertising partners. The benefits they reap from this data end at the capabilities of their machine learning team and the type of data they actually have. An example of this is translation: Google has had billions of translation requests. With languages significantly different from English, it can struggle to get the message across. However, one can also leave feedback on how good or bad the translation is, and even suggest a better one.

At the same time, Facebook has made leaps and bounds with their translation feature, with machine learning tools that can quickly learn and identify slang, which is often a source of mistranslation. They also allow users to provide feedback on how good of a translation was made, on a scale of one to five. Their models are beginning to make sense of figures of speech, and their researchers also claim to be making strides in understanding the context of a caption when next to a picture in a Facebook post.

Then there’s companies like Unbabel, Translation-as-a-service providers that use machine learning to translate from the source language to the target language, much like Facebook and Google. However, they then use human translators to correct and perfect the translation. Basically, the computer does 90% of the easier legwork, while the human does the final 10% that requires actual intelligence to fully bring the translated text into their native language. They claim a turnaround rate 80 words in under ten minutes, and naturally, we can expect much better translation from this service than we would from Facebook or Google alone.

But my overarching question to all of this is: Does competing with this compartmentalized, translational data actually benefit us? Shouldn’t the goal of all of these companies be about uniting us, and not just competing with each other so that one’s data-based product is marginally better than the other’s? I’m not saying that these companies shouldn’t compete with each other in general, but I think when it comes to serving society, their function fails in regards to how they treat the data that we produce for them.

My solution is radical, and I wouldn’t expect it to be embraced any time soon, but I believe companies that engage in this practice with the right methodology will see unprecedented benefits. Using the translation example: Imagine if Facebook, Google, and Unbabel took all of the translation data of their individual companies and partitioned it out into a conglomerate. It would be a totally separate company, to combine all of their data, using the machine learning teams from all three companies.

Facebook would be able to provide their one-to-five feedback on millions of unique phrases only seen in the arena of social media. Google would be able to provide their helpfulness feedback on millions of translation requests. Unbabel would be able to provide legitimate translation on millions of phrases by actual humans. A super-model could be trained with that data to translate the sentences of others with almost perfect human accuracy. Language and barriers would be removed globally, and gradually, cultural barriers could be as well.

Sounds lovely, doesn’t it? Idealistic, but possible; and that was enough of a basis to get us to the moon and back. But that can’t happen as long as we let our short-sightedness get in the way of our humanity. A movement of this scale requires us to look beyond our immediate selves and our immediate companies. It requires seeing how helping everybody globally with data everyone produces will also help ourselves, and foster a new paradigm of cooperation.

Many cities in America today make much of their data and statistics they collect publicly available. Part of the reasoning behind this is that because the city governments are paid for by the public, the data they collect as a means of functioning well are also owned by the public. Another line of reasoning, not as prevalent, is that because much of the data they collect is produced by the actions of the public, the public is entitled to some form of it as well. If this latter line of reasoning is combined with the notion that we could all benefit from new efforts to combine and analyze the different types data we all produce for separate companies, what we have is a compelling argument to give the unity of big data a trial.

Given the potential for public good, the way forward is controversial, uncertain, and necessary.

The Coming Unification of Big Data

Written by Gabriel Abraham Garrett