What’s Challenging in Big Data Now: Integration and Privacy

What's Challenging in Big Data Now: Integration and Privacy

It’s been said many times before, but it’s worth stating again: big data presents large opportunities for improving business and society, but it also involves sizable computing challenges, as well as moral challenges. A panel of renowned professors in the field expounded on the obstacles blocking big data’s path forward during a recent meeting of the Association of Computing Machinery (ACM). Privacy and integration issues led the way.

In celebration of the 50th anniversary of the A.M. Turing Award, which is sometimes called the “Nobel Prize of computing,” the ACM convened a gathering of some of the brightest minds in computing, including Michael Stonebraker, an adjunct professor at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the 2014 winner of the Turing Award.

When asked what challenges big data represents in a super-connected world with 25 billion data-generating devices, Stonebraker commented that two of the three “Vs” have basically been solved.

“If you have a volume problem and you’re interested solely in running SQL-style business intelligence on a lot of data, in the data warehouse market, there are at least a few dozen production petascale warehouses that do exactly this day in and day out,” said Stonebraker, who created the Vertica MPP database now sold by HPE. “[T]he volume problem is basically solved and shouldn’t get much harder in the future.”

Similarly, the data velocity problem is under control. “If you want to process a million messages a second, current stream processing engines can do this quite easily,” he said. “I’m not aware of anybody that wants to go faster than that….I don’t consider the velocity issue to be all that difficult.”

But the third “V,” the one that pertains to variety, is a potential deal-breaker, according to Stonebraker, who labeled it the 800-lb. gorilla in the corner of the room.

“As near as I can tell, [data variety] is what is causing problems for nearly every major enterprise on the planet,” he said. “I think what is going to kill everybody isn’t necessarily the number of connected devices, but the variety of independently-constructed data sources that enterprises are going to want to put together. Whether you’re talking about healthcare, manufacturing, or financial services, all of these independently structured databases are going to be a killer.”

David Blei, a professor at Columbia University and a winner of the ACM-Infosys 2013 Foundation Award, said there are great opportunities to benefit from big data, but also some unmet challenges.

“If you take the example of genes and diseases, it’s an important computer science and statistics problem that’s unsolved,” Blei said. “Data scientists are looking to answer how we take data that we observe from the world and use it to identify causal connections between two variables.”

Dealing with the uncertainty and biases that can arise from basing conclusions around correlations is something that all big data practitioners must tackle. It’s also something that Daphne Koller, an adjunct professor of computer science at Stanford University and an ACM-Infosys 2007 Foundation Award, brought up during the panel.

“Bias will always be a challenge, and there isn’t a single, magic solution,” Koller said. “The bigger question is, How do we disentangle correlation from causation?…I’ll turn to healthcare for an example: the gold standard in the medical state is that of randomized case control.

Posted on 7wData.be.