vin sharma
6 min readSep 16, 2016

Machine Learning is the solution to the big data problem whose root cause is the Internet of Things

Yes, the title is a precious mess. But “Big Data” was always defined as a problem statement. For some, it represented the challenge of acquiring data from new sources. For others, it meant the herculean task of building a scalable infrastructure that could manage all the data. For a brave few, it meant the arcane art (or presumptive science) of extracting value from data using advanced data analysis techniques and tools.

For cloud service providers (CSPs) whose business depends upon solving these challenges, the scale of online user-generated data inspired the development of a radically different hardware for datacenter infrastructure and a new kind of software for orchestrating workloads intelligently and efficiently on that infrastructure. When these cloud computing technologies — designed to increase datacenter automation — were released to the open source community, they spawned projects such as Docker, Kubernetes, and Apache Mesos. At the same time, CSPs developed data storage and processing software that could handle the scale of human-generated data. Apache Hadoop and Apache Spark are children those ideas begat. The recent rise of Data Science as a profession stems from the acute need to detect signal in the noise of this ever-increasing flood of data. Traditional enterprise IT, already burdened by server sprawl, cost overruns, glacial procurement processes, and monolithic applications resting on data silos, lumbered like mastodons into the tar pits of stasis. The adoption of open source, cloud computing, and big data technologies, and along with them the adaptation of processes and people was a tectonic shift.

Today, we face another seismic change — a new eruption of data that is several orders of magnitude greater than the accumulated tracks of surfers, shoppers, and their social networks that once seeded clouds. We look with awe upon data generated by smart phones, driverless cars, industrial drones, cube satellites, smart meters, surveillance cameras, and millions of other things that now populate the Internet. And after the shock subsides, we realize that the level of automation that allowed us to manage data in the cloud era must now scale several-fold in order to analyze data in the era of the Internet of Things (IoT).

Automation is the key to unlocking the potential of IoT while locking down its complexity. We need things to become smarter automatically while responding to their environment. We need systems to become more intelligent based on the history of their interactions with users. We need technologies and tools that can help these devices and systems learn from their experience. Where once we asked analysts to generate “insights” from data and make decisions that drove eventual changes in system operation, now we must ask systems to learn from data automatically and respond appropriately. In short, the IoT needs automation and automation needs Machine Learning.

Machine Learning — the study, development, and application of algorithms that improve their performance of tasks based on prior interactions — is the key to making things that learn from experience and get smarter with use.

Consider autonomous vehicles, one of my favorite projects at work. The vehicle is an embodied autonomous agent — an AI with wings or wheels. To build one, you have to help it construct a model of the world based on data from millions of miles of driving by test cars equipped with RADAR, LIDAR, IR, and cameras. They use maps in order to plan paths. But they are not and cannot be programmed explicitly with rules for every scenario they might encounter in the real world. For cars to operate autonomously, they must be trained, much like human student drivers, to recognize objects such as other vehicles, highway signs, lane markers, trees, and pedestrians in the visual field. They must learn to navigate and control the movement of the vehicle in response to dynamic conditions. And much like a student driver, they learn by making mistakes and improving their accuracy with practice. At first, a trainer — the data scientist — annotates the training data to label correct responses and supervises the learning process of the algorithms that make up the model. But eventually, the model learns to recognize the objects, localize them in space, and track their movement well enough to operate in the real world.

Another space that I’m tracking very closely is the use of machine learning to drive systematic trading strategies at hedge funds. Having talked to dozens of hedge fund quants, I’m now convinced that hedge funds are like heat-seeking missiles when it comes to extracting value from data. Where hedge funds were once dominated by discretionary approaches — with money managers relying on wealth of knowledge, experience, insight, and a fair bit of luck — to execute strategies, we now see the growing use of algorithmic or systematic strategies. And the algorithms are getting smarter faster. They’re feeding on treasure troves of diverse data sets, from stock market data to satellite images, to make complex decisions. Not necessarily fast — as is the case in high-frequency trading — but rather methodically. What were once simple rules and heuristics are now complex mechanisms embodied in “AI agents” that may remain dormant until they detect a pattern in the conditions and trigger a trade at just the right moment.

Interestingly, the vast diversity of such machine learning problems rests on the the foundation of relatively simple but powerful algebraic operations on matrices. They challenge lies in being able to handle matrices that are often large but sparse, or dense but “tall and skinny”. And the troublesome fact is that these operations need to be performed at massive scale in milliseconds. Often this is constrained by algorithmic complexity — the order of magnitude of time that it takes to complete the algorithm regardless of how well it is coded. But inevitably, we need a storage and compute infrastructure that is up to the challenge.

So how do you get there? To build a machine learning solution, you need sensors and systems that collect data from the edge, whether that means an autonomous vehicle on a highway or a point-of-sale device at a retail store. The sensors need to relay some of that data to a cloud platform designed to handle data at scale. You need models based on machine learning that can learn from the data to make inferences. You need the machine learning algorithms to be implemented for speed at scale. You need the mathematical operations — the computational kernels of these algorithms — to take advantage of processor, memory, network, and storage features to get the best performance out of the system hardware. You need systems equipped with processors with multiple integrated cores, faster memory subsystems, and architectures that can parallelize the processing.

And then, as your data grows, you need scalable clusters of these systems that allow you to train a complex model based on machine learning over a big data set distributed across a large number of systems.

These are the very characteristics of systems based on the new, second-generation Intel® Xeon Phi™ processor. As a server-class product designed for high performance computing, the Intel Xeon Phi processor delivers the performance needed by machine learning algorithms. It is optimized for a subset of machine learning known as deep learning, where the algorithm takes the form of a multi-layered neural network composed of non-linear functions. A cluster of 128 Intel Xeon Phi processors, each equipped with 64 cores, can reduce the time to train a neural network topology like AlexNet by 50 times compared to single-node client-grade processor.

Developers can extract maximum performance from Intel Xeon processors by using the library of math kernels and optimized algorithms from Intel called Intel® Data Analytics Acceleration Library (Intel® DAAL) and Intel® Math Kernel Library (Intel® MKL). These libraries include implementations of fast Fourier transform (FFT), generalized matrix multiplications, statistical tests, and several classic machine learning algorithms that improve the performance of a wide range of higher-level ML algorithms and deep learning topologies.

Also, developers working with deep learning frameworks such as Caffe and Theano can benefit from the work of Intel’s software developers who integrated Intel MKL into these frameworks. By using the Intel-optimized frameworks backed by Intel MKL, I’ve seen customers get performance on deep learning network topologies — convolutional neural networks (CNN) and recurrent neural networks (RNN) — that is 30 times better than running these frameworks un-optimized on client CPUs. Intel’s code modifications to the DL frameworks are open source. You can get the Caffe and Theano code optimized for Intel® Architecture from GitHub today.

Machine learning is what it takes to survive in the IoT era. To build and sell products and services to customers and to retain their loyalty, organizations must make things smarter. And the best known technology for making machines smart is machine learning.

I’m part of the team at Intel that can help enterprises in the transportation, financial services, healthcare, energy and other vertical industries build and deploy large-scale solutions based on machine learning models. We can provide development platforms based on Intel Xeon and Intel Xeon Phi processor, software, tools, and training, as well as reference architectures and blueprints to accelerate the deployment of enterprise-grade machine learning solutions. I’m easy to find and always happy to talk ML to strangers.

vin sharma

autodidact, bibliophile, open source wonk, complex systems geek; building trustworthy ML; stories are my own.