BIG DATA
Digital footprints and the limits of human generated programming
Every day, the world creates 2.5 quintillon bytes of data. Emails, phone calls, text messages, Tweets, Facebook posts, sensor data, log files, and media streams — both structured and unstructured data — make up this volume of daily information. More data has been generated in the past two years than has existed in the whole of human history. And we’re just getting started. Every day more and more data sources are created. Mankind keeps figuring out new ways to capture more: more Fitbits, more iPhones, more Nests. In next 4 years, an additional 40 billion connected devices are expected to come online as part of the Internet of Things (IoT), sensors for every device in the household: the fridge, the toaster, your TV, your bed and everything else will be collecting data and reporting it back to 3rd-party data-centers along with billions of datums from other users and devices and around the globe. These large datasets lead to a particular type of problem, mainly, how to process and extrapolate useful information from such large databases in a meaningful amount of time. This is the problem of Big Data.
As the world wide web took off in the late 90’s, Google and other companies struggled with how to efficiently search the web. As companies like Alta Vista and Yahoo focused on maintaining categorized directories of websites, Google’s spiders crawled every website and used its PageRank algorithms to return relevant search results from hundreds of thousands of websites in a fraction of a second. Google would go on to crush its competitors and become one of the largest corporations in the world, and would start a data processing revolution in 2004 when they released a paper on a process called MapReduce.
The problem with Big Data is that traditional data processing applications, namely relational databases, are inadequate for analyzing the huge volumes of data associated with it. In order to process and query these massive indices, the job has to be split up between multiple computing nodes operating in parallel, each operating on a smaller chunk of the larger set. MapReduce described a processing architecture to accomplish this, and it was later combined with a Distributed File System and brought together in the Apache Hadoop framework. Today, Yahoo has over 40,000 servers running Hadoop, the largest cluster of which contains 4,500 nodes.
Hadoop was introduced in 2006, and by 2011 “big data” was the focus of another venture capitalist hype cycle. The FANG companies — Facebook, Amazon/Apple, Netflix and Google — plus Twitter and Yahoo, were awash in a gigantic volume of data, according to venture capitalist and Big Data blogger Matt Turck:
“ Those companies […] had no legacy infrastructure and were able to recruit some of the best engineers around, so they essentially started building the technologies they needed. The ethos of open source was rapidly accelerating and a lot of those new technologies were shared with the broader world. Over time, some of those engineers left the large Internet companies and started their own Big Data startups. Other “digital native” companies, including many of the budding unicorns, started facing similar needs as the large Internet companies, and had no legacy infrastructure either, so they became early adopters of those Big Data technologies. Early successes led to more entrepreneurial activity and more VC funding, and the whole thing was launched.”
According to Turck, the hype has died down a bit since 2015, and he notes that Big Data has moved past the early adopter phase, moving into the adoption of big data by the broader business community. He continues:
“Another key thing to understand: Big Data success is not about implementing one piece of technology (like Hadoop or anything else), but instead requires putting together an assembly line of technologies, people and processes. You need to capture data, store data, clean data, query data, analyze data, visualize data. Some of this will be done by products, and some of it will be done by humans. Everything needs to be integrated seamlessly. Ultimately, for all of this to work, the entire company, starting from senior management, needs to commit to building a data-driven culture, where Big Data is not “a” thing, but “the” thing.
In other words: lots of hard work.”
Big Data doesn’t have a formal definition; when UC-Berkley asked 43 industry professionals for one in 2014, they got back 43 different answers. One thing that most experts will agree on is the “3Vs” model of Big Data: volume, velocity, and variety. Volume refers to the amount of data. It’s something of a moving target that depends on the need of an organization and the growing capacities of hardware and software. A person trying to process an Excel spreadsheet with several hundred thousand rows could be considered to be having a Big Data problem; a company might have a Big Data problem with several hundred gigabytes of data in a SQL database, but another company might have no problem processing several terabytes. Velocity refers to the speed at which data is generated, (think of the 270,000 tweets sent every minute), and variety refers to the various data types that are generated by an individual or organizations: phone call metadata, log files, pictures, video, audio files, text, sensor readings and so on. Over the years, many in the field have sought to expand the definition past the 3Vs model, noting that big data also encompasses the following:
· Digital footprint: the free byproducts of activities generated from other digital interactions, created either actively or passively; e.g a blog entry, Instagram post, or readings from a smart scale or heart rate monitor.
· n = N: Previous statistical analysis required sampling, or selecting a smaller population from the larger population of the universe. This is no longer the case for many applications. With Big Data, researchers are able to analyze the entire data set, the entire population.[6] One example, from political campaigns, is the campaign voter ID list. In years past, a campaign might rely on a small list of volunteers and supporters that were gathered by hand, representing a small segment of the voting population. These days, the political parties maintain lists of every registered voter in the territory, and are able to footprint individuals based on phone calls, canvassing efforts, and even Facebook and Twitter posts to gauge their support for a candidate and likelihood to turn out and vote. This data is used for ‘Get Out the Vote’ targeting efforts on election days.
· Data-fusion: Refers to the ability to process multiple data streams into a coherent picture of an real-world object. Analogous to using data from multiple geospatial sensors to create a map. The point is to combine information from two objects to create a new object that is more valuable than the individuals.
· Machine learning: Since Big Data is resistant to the type of comprehensive analysis that has been the norm, data scientists are relying more on machine learning algorithms to test hypotheses and gain insights from them. In fact, the more data that you throw at a machine learning system, the better it is able to derive these insights and learn them. Big Data has led to an explosion in artificial intelligence and machine learning in the past few years.
Perhaps one of the earliest and most famous examples of how these insights are being discovered is the story of how Google learned to predict the spread of the flu. During the early days of the swine flu (H1N1) panic of 2009, the Centers for Disease Control and Prevention (CDC) needed a way to track cases of the disease in order to slow its spread. In order to that, they needed to know where the cases were at, so they requested that doctors report new cases to them. The problem with this approach was the delay in information getting back to the CDC. A person with the flu might wait several days before contacting a doctor, and there would be another few days for the information to be relayed back to the CDC, and another week for them to process the results. During a potential epidemic, this delay was unacceptable.
Months before H1N1 became news, Google had discovered that they could correlate certain search terms with the spread of flu down to specific regions. How did they do this? By feeding search query data and flu case data from 2007 and 2008 into a system that tested 450 million different models until it discovered 45 search terms that correlated with actual flu cases. Once the model was refined, they were able to predict, in near-time, where the flu was spreading. This gave public health officials valuable information that they used to prevent H1N1 from becoming a public health crisis.
Not all Big Data insights are welcomed, though. A 2012 story in Forbes details how Target statistician Andrew Pole was able to determine, based on their shopping behavior, how likely a customer was to pregnant. Pole called it a customer’s “pregnancy score”.
“As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.
One Target employee I spoke to provided a hypothetical example. Take a fictional Target shopper named Jenny Ward, who is 23, lives in Atlanta and in March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug. There’s, say, an 87 percent chance that she’s pregnant and that her delivery date is sometime in late August.”
Target would then use this data to send targeted coupons for baby and maternity clothes, cribs, and other related items. The targeted mailings became famous after it was discovered that Target was able to figure out that a one particular customer, a teenage girl from Minnesota, was pregnant before her father found out.
The amount of personal data being gathered by Big Data companies such as Google and Facebook may frighten some people due to the potential for abuse and loss of privacy, while others blindly go about sharing their most private personal details on Instagram or Facebook. The incessant cookie tracking and sharing between the FANG companies may be annoying enough when it’s an ad for a product that you searched for one time following you around as you browse the web, but what happens when these databases and algorithms are used to make decisions that really effect an individual’s life? Banking lenders are already using Big Data in order to verify, identify and determine credit default risks for low-income, unbanked persons and businesses. It’s one thing when Netflix, Amazon or Hulu makes a shoddy recommendation, it’s another entirely when you’re turned down for a loan or put on a terrorist watch list because of some black-box computer system.
And these black-boxes will only become more opaque over the coming months and years as Big Data analytics becomes more and more reliant on artificial intelligence. As Turck notes, the algorithms behind deep machine learning were created decades ago, but it wasn’t until recently that the data and computation costs became cheap enough to be useful. The big paradigm shift that we’ll see in the next decade is that it will no longer be the data scientists coming up with the models, but unsupervised machine intelligence. And as these neural networks, Bayesian networks and evolutionary algorithms have started to take over the heavy lifting of our data systems, it’s becoming less and less possible for their human masters to comprehend what’s going on below the hood.
In “The Programs that Become the Programmers”, author David Auerbach and interviewee Pedro Domingos argue that the traditional programming paradigm of strict control over the details of an algorithm doesn’t scale well to Big Data sizes, and that machine learning deals with this problem in 3 ways:
(1) It uses probabilities rather than the true/false binary.
(2) Humans accept a loss of control and precision over the details of the algorithm.
(3) The algorithm is refined and modified through a feedback process.
Unlike traditional programming, where bugs have to be dealt with before release, machine learning algorithms update on the fly, in real time, adjusting themselves as more and more data is added to their inputs. In this new world, computers aren’t programmed, they are trained. Today’s data scientists find themselves refining the starting inputs for the evolutionary neural networks, which then go on to develop algorithms that no human mind can understand. Facial recognition, language translation, even the systems that Facebook uses to populate a user’s news feed are all being run off of these semi-autonomous systems, and the results of such generative systems are as unfathomable to their human designers as are the contents of their own heads.
One thing in the Big Data landscape remains clear though; the need for data scientists and subject matter experts will increase beyond what the labor force can supply. As IoT takes off over the next few years, organizations within everything from the small business to enterprise spaces will struggle to handle the sheer amount of data coming from the additional 24 billion devices that are expected to be deployed by 2020(10) . The networks and security infrastructure of the IoT landscape will dwarf anything we’ve currently seen. Considering the current shortage of candidates in the cybersecurity industry today, (17,000 unfilled positions in the state of Virginia alone,) it seems that job security for forthcoming data scientists will be secure through the next decade. As one Gartner research analyst said, “Who’s going to do this stuff?”
Big Data has made it through the venture capitalist hype cycle, through the early adopter stage, and is now to the point where it’s become entrenched into business operations, much in the same way that people have become accustomed to the “world wide web” and internet technology. it is essentially invisible. The lack of able bodies for the position is causing a bit of a crisis in the industry, as many companies are falling behind in competitive advantage due to the fact that they cannot acquire suitable candidates.
In the short term, Big Data will no doubt lead to a renaissance of opportunity for former business analysts proficient with Python, R, and SQL. Competition between experienced professionals and the new crop of students will nothing beyond a degree will continue to define the industry as Fortune 500 companies seek to gain competitive advantage against other firms. But as long as the industry continues to be split by the various startups and firms involved in the marketplace, there should be room for all. Since there has been no company able to demonstrate the ability to provide a ‘one stop shop’ solution to the various Big Data applications available to public, it is only logical to presume that a mixed ecosystem of public and private entities will eventually be responsible for the future data ecosystem.a crucial competitive advantage among Fortune 500 companies. As the job title of ‘data scientist’ becomes more overused and abused, hiring managers will be required to test candidates based on their capabilities, not on their titles.