WiFi Has a Complexity Problem. Machine Learning Might Fix It.

How Cisco engineers apply machine learning expertise to anticipate outages, troubleshoot issues autonomously, and track down root causes.

Enterprise WiFi is undergoing a large disruption thanks to AI and machine learning. While it is a huge productivity and even a morale booster for employees, and holds promising opportunities for deeper and more targeted customer interactions, WiFi remains one of the most challenging and even frustrating domains of an IT operation — thanks mostly to complexity.

Many of these frustrations stem from the challenges in managing a highly dynamic environment, providing a good experience for customers and employees, and chasing down a root cause when things go wrong. The latter is a really difficult task.

For IT, finding a root cause means cobbling together an ever-expansive toolset for onboarding, deploying, managing and troubleshooting.

An almost endless number of variables factor into a bad WiFi experience. What type of device is it? What building is the employee in? What application is she using? What time of day is it? Did she just change locations? Is it an authentication issue? A throughput issue? A noisy channel? The route to finding a cause is never direct, and that’s if wireless is the problem at all.

A study on wireless experiences in the workplace from ZK Research uncovered two important findings regarding WiFi. One, when it comes to nearly any performance issue of slowness or bad connection, wireless often gets the blame (even when it’s not actually the cause); and two, IT professionals spend a lot of their time troubleshooting it.

“With wireless networks becoming the norm for the large enterprise, the research shows that collaboration and remote troubleshooting have a long way to go,” said Zeus Kerravala, principal analyst at ZK Research, “and that administrators are spending too much of their time dealing with the connections.”

That same study from ZK Research suggests bad WiFi can even be a productivity killer as much as it is a booster. Poor network and application performance accounts for an estimated average 14 percent loss in productivity.

While WiFi has its current challenges, it also has even greater future challenges — namely the increasing demand for bandwidth. As more and more companies allow employees to roam around the office and access WiFi on their own devices, and as more connected devices join the network, more bandwidth is needed to keep pace.

The technology world entered the so-called “zettabyte era” some time ago, and we have not slowed the rate of data creation and consumption, which is continuing at a breakneck pace. Cisco’s Visual Networking Index, a forecast report on global Internet traffic, shows some pretty eye-popping figures specifically for wireless. By 2021, the report says, smartphone traffic will surpass PC traffic by growing nearly double the PC traffic rate. Wireless and mobile devices will account for “63 percent of total IP traffic.” And the number of devices connected to IP networks will be three times the global population, meaning there will be 3.5 networked devices for every human being on the planet. Yet, by that year only 58 percent of the world’s population will be using the Internet, leaving plenty of room for even more growth beyond 2021.

For all these reasons, network managers and IT professionals will face a much bigger challenge if they are using outdated tools to manage and troubleshoot WiFi.

These challenges are exactly why many companies, including Cisco, are looking at ways to apply AI and machine learning to wireless environments. The wireless portion of the network provides rich opportunities for automation, insight from analytics, better experiences for customers and employees, and the ability to better equip IT staff from the helpdesk to top-level network engineers to find and fix problems when they arise.

J.P. Vasseur, Cisco VP and engineering fellow, has been working with his team of engineers and data scientists to solve many of the toughest problems using a machine learning engine they have been developing since 2016.

Cisco DNA Analytics is a cloud-based analytics platform for users of Cisco DNA Center, which is Cisco’s central management platform for the network. Vasseur and his team identified machine learning as a potential solution to a lot of the common frustrations running wireless in an enterprise environment, and it’s a perfect fit to Cisco’s ongoing focus toward bringing simplicity, machine learning, and analytics to the network.

Vasseur is an elite engineer with a storied history at Cisco. He’s the author of some 450 patents and 35 RFCs in his nearly 20-year tenure with the company. Having worked on core technologies and protocols, Vasseur says his passion is in driving teams of talented engineers to launch new products. And wireless has been in his sights for the last couple of years.

Specifically, he and his team at Cisco want to provide a machine learning platform that would analyze and manage wireless behavior (roaming/joining failure rates), application behavior (per-application throughput, global throughput), network resiliency (noise and interference, link quality, node and link failures), as well as other networking applications like WAN performance (4G/5G, proactive routing/QoS, traffic engineering) and security (endpoint device classification, for example). It’s a long laundry list for one highly complicated technology.

By applying a collection of machine learning algorithms to wireless traffic flows, according to Vasseur, network engineers and IT staff can get much deeper insights and cut out a great amount of unhelpful system noise, redundant alerts, and generic KPI metrics to truly understand complex network dynamics. The power of the analytics engine, he says, is its ability to autonomously identify issues, trim troubleshooting time by multiple hours, and even do root cause analysis. Specifically, the Cisco DNA Analytics platform focuses on use cases where machine learning is the only viable answer, both because of the complexity involved, and because existing solutions fall short.

Vasseur said in an interview that aside from requiring Cisco DNA Center to run the platform, the model does not require any particular software or hardware to run. It does, however, make use of an ocean-sized pool of data to do its learning.

Right now, Vasseur says, the model is running on the wireless networks for 20 pilot customers, which spans nearly every industry including retail, education, finance and manufacturing. One customer in particular, a large US retailer, sees more than 200,000 clients or devices weekly. The platform also captures a staggering 16 million join and roam events every week.

“Every week we are gathering more than 1 million hours of telemetry traffic for WiFi, which is absolutely enormous,” Vasseur said. “And this is for 20 customers, and only a slice of the network [meaning only WiFi]. If you translate it into the number of years of traffic per client, we are almost at 5,000 years of telemetry traffic per client. At some point, we will have so much data that it’s not going to matter. What matters more is the quality, because we do a lot of cleanup,” such as “denoising.” That requires a great deal of both networking and machine learning expertise.

A dashboard view of Cisco’s DNA Analytics platform. The top view shows the reported issues in red, along with a band of “normal” onboarding performance in green that the machine learning model learns on its own. The blue line is the actual performance of the AP, which is host to 4,389 wireless clients. The graph at the bottom shows a pathway of events, failures, and causes, all of which are learned and diagnosed autonomously by Cisco DNA Analytics platform.

That cleanup, Vasseur says, is essential for the model, because “it’s not just the algorithm that makes it important. It’s the access to data no one has access to.” And being able to make use of that data includes making it digestible by the network operators who will be seeking insights from it.

Gathering such huge swaths of data, anonymizing them, and putting them into the cloud enables the analytics platform to do its own autonomous learning and define what is normal and what is abnormal inside a network. It is essential to know the difference, Vasseur says, because the point is not to provide a monitor using generic metrics, as is the common method now. Instead, the model is designed to learn the characteristics of each network individually, as each network is highly unique.

Conversely, metrics and KPIs provide only a limited view of network performance. Take CPU usage as an example. “People will say, ‘I want a log alert when my CPU usage is high.’ Just use a simple threshold when CPU goes above 90 percent, and that works fine,” Vasseur said. But that threshold cannot provide answers as to why CPUs are running high, or answer dynamic requests like, “alert me when my WiFi onboarding experience is not good.”

Vasseur says metrics and thresholds still have their useful applications. What’s crucial for his team, and for Cisco DNA Analytics, is knowing where an application of machine learning makes the most sense.

A machine learning model, pulling in contextual information from a large variety of indicators, can do just that. But not all models, not all platforms, are the same. One of the hardest jobs of an IT manager or even a CIO is to evaluate the actual usefulness of a vendor’s tool.

“Something that we’ve been struggling with is when you say ‘machine learning,’ you won’t be sure how smart the system is,” Vasseur said. “I made that mistake before. You start [a demo] of some fancy stuff, and the customer says, ‘I’m not interested. You guys do the math, but show me what’s wrong. That’s what matters to me.’”

Responses like that are why Vasseur and his team had a requirement for simplicity in mind from day one. Cisco DNA Center is designed to manage an entire network and take very complex processes, like network segmentation as an example, and turn them into simple tasks requiring only a few clicks to push policy changes across an entire network. Vasseur’s platform is designed exactly the same way.

In the future, the combination of cognitive and predictive analytics will be able to tell you not only when a problem happened, not only why a problem happened, but it will be able to predict outages, potential roaming issues, onboarding trouble, and other common wireless problems before they even happen.

The Cisco DNA Analytics platform, Vasseur says, is also capable of taking these analytical insights further, painting a picture that would be nearly impossible to do with the current network operator toolset: a problem growing over time. Whether it is the degradation of an AP’s performance or repeated throughput issues, Cisco DNA Analytics can show a performance map that highlights why issues have been popping up and trace them back to where they started.

The Cisco DNA Analytics platform is capable of tracing single issues back to an root cause or original occurrence, displayed in the form of a timeline for any given AP.

It’s a way of providing a complete picture where network engineers previously may only have a hunch based on a few touchpoints of disparate data. Instead, the user gets a historical map of how a single AP has performed over time along with a color representation of the number of issues it has seen on its pathway to bad health.

“We are using machine learning to find out what is a deviation from the norm,” Vasseur said. “What is really cool is now we can also step back in time. It’s not just a view of real-time, which you also need, but now you can correlate issues over time.”

For example, if an access point is consistently exhibiting poor throughput, a network admin can use Cisco DNA Analytics to quickly view an historical map of that AP’s performance over time and trace the poor performance back to a specific date. Maybe it started with something simple like a firmware upgrade, or that new security analyst doing an 800-gigabyte packet capture first thing in the morning, or that new 11 AM meeting when everyone simultaneously downloads a four-gigabyte presentation file. Pinpointing the origin of poor performance could provide a perfect starting point to narrow down the problem before even opening Wireshark to start troubleshooting aimlessly.

Vasseur says these features are the type of powerful insight that only a machine learning platform, learning from a very large amount of relevant data, can provide.

The future will show how these techniques develop over time, and how a competitive market implements them to make the job of managing wireless easier for network engineers. Vasseur’s team is focusing on applying machine learning to a number of other networking applications like SD-WAN and endpoint device recognition that are showing promising results. He expects they will be demoing some of these applications in the near future.

To learn more about Cisco DNA Analytics, check out the informational page here. JP Vasseur recorded a live demo with Tech Field Day, found here.

Also, hit the follow button for future articles on Shifted, and follow Owen Lystrup on Twitter here.