ML-powered Anomaly Detection in IoT, and why is it so important for businesses?
The Internet of Things is on the way to becoming the standard in Manufacturing, Healthcare, Agriculture, Energy, and other industries. Businesses that use IoT devices extensively have access to an enormous amount of information. But data collection can be hard to benefit from without proper ways for processing and analysis.
One of the most popular ways to benefit from IoT is to leverage Machine Learning for detecting abnormal events or suspicious changes in the pool of collected data. This process is called Anomaly Detection, and it allows detecting events and patterns unreachable for human experts. Using Machine Learning instead of the team of human experts for this task is better in terms of automation, analysis in real-time, identifying the smallest changes in datasets, precision, and the speed of self-learning based on experience. Machine Learning algorithms can find patterns of abnormal behavior that your experts wouldn’t even think about in the first place.
Anomalies in the functioning can lead to machinery breakdowns, increasing downtime on the production, or even be a sign of hacker attacks. You can prevent this from happening by introducing an ML-powered system for Anomaly Detection. But what information do you need to collect to make that project possible? How can the quality, volume, velocity, and variety of data affect the final result? Is more information always better, or is there any nuance in quantity? We will cover all of these questions and more in the article.
How Anomaly Detection with Machine Learning works
The Anomaly Detection process combines both automatic and human-assisted stages and usually consists of the following steps:
- Collecting the data.
- Determining the problems, the challenges, and the goals.
- Conducting exploratory data analysis.
- Feeding data into ML system.
- ML algorithms are training based on data.
- The ML system alerts on any deviation from the model.
- A human expert decides whether this deviation can be considered an anomaly.
- The ML system learns from this decision and acknowledges it in future predictions.
- The ML system continues to accumulate patterns.
The order of these steps may change depending on the case. There can also be a necessity to return to the previous step. For example, after exploratory data analysis, it can turn out that collected data is not enough, and there is a need to get some more.
Before feeding information to the system, human experts can label data to provide more context for Machine Learning models. Data labeling is the process of identifying raw data and placing it in different categories. Labeled data is essential for the Supervised Machine Learning technique, which requires both normal and abnormal labeled sets for building a predictive model. On the contrary, Unsupervised Machine Learning does not require labeled data.
But before the algorithms can do their magic, the information needs to be collected properly. Here is how an abstract data pipeline looks like:
Collection stage — information is collected from IoT devices, sensors, files, application logs, databases, or cloud storage. This process happens between the data source or producer and the data consumer.
Ingestion stage — at this stage, the raw data is being transferred to the system, but it also can be stored here for further processing and analysis.
Preparation stage — the information is being converted from raw data to appropriate for a particular system.
Analytics and Visualization stages — Visualization and Analytics are usually part of an application, but in the case of ML, it could be a vital part of the model development flow.
As you can see, Anomaly Detection with Machine Learning requires multiple steps, and it is a sophisticated process in general. Today we will focus on the first step for an Anomaly Detection project to be a success — the proper collecting of the information from IoT devices. Here is some advice based on our own practical experience.
Key Tips on Data Collection and Storage for Anomaly Detection
Collect as much data as you can
If you want to build an Anomaly Detection project, it makes sense to always keep in mind other possibilities for Machine Learning projects. Before starting a data collection process, consider what types of information you can obtain from your units and get as much as you can. Because the direction of a project may change based on data that was collected. The obtained information may not suit Anomaly Detection, for example, but be perfect for Predictive Maintenance or Fraud Detection. Getting as much information as possible empowers you to explore all possible scenarios with your ML team. Sometimes the result is not a planned ML solution but a complete change of your initial goal.
Raw Data is better than Aggregated Data
It is better to store raw data because you can always apply various aggregations to it later. But if you store aggregated data, it would be impossible to get the original data out of it. Aggregated data tend to smooth out inaccuracies, and these exact inaccuracies can be crucial for precise Anomaly Detection. For example:
- The temperature sensor detects a value significantly higher than the norm at one particular moment.
- A set of aggregated data is only showing the median value for a certain period, making it impossible to figure out when exactly the anomaly happened and what the actual temperature was at that moment.
- To prevent this, you need to store all values from a set, so the ML algorithm might detect the peak value and identify abnormal behavior.
Your company may already have a big collection of aggregated data. Does that mean that it is not suitable for Machine Learning? You can still make an effort to use aggregated data for an Anomaly Detection project and get some results, but, most certainly, the precision of this solution will be less accurate than the one using raw data. Aggregated data is better than nothing. The team of Machine Learning and Data Science experts can still use it for exploratory analysis, providing suggestions, getting insights, and recommendations based on it. However, the information can be aggregated in different ways, to the extent that data becomes completely unusable.
Label the data properly
If you are in control of labeling, make sure to do it right:
- We had a use case when the same label indicated multiple behaviors or statuses of the device. Don’t make the mistake of labeling statuses like “broken” or “in maintenance” the same way. It is very important for Machine Learning to work with correct and detailed labels that match the precise behavior of the device.
- Label the data according to the focus of your project.
- It is better to label data automatically, if possible, to eliminate human bias. Sometimes human experts can’t label the anomaly behavior in time and right on the spot, which will lead to incorrect information for an ML algorithm as a result.
Store the historical data
In one of our cases, the client deleted the information every six months, which limited the possibility to implement an ML solution. Half of the year is a very short period for Machine Learning algorithms, especially in the time of the Coronavirus pandemic. Why did the client do that? Because storing this information was expensive. What can you do in the same situation? You have two options: you can store the data in a cloud or at your storage. Both options are equally usable. When you keep information at your storage, you are responsible for its safety, and you are in complete control of the fact that your information will not be deleted or damaged in any way. Cloud storage offers the creation of multiple copies of your data and also takes full responsibility for security.
The main difference is in the price. There are cheap ways to store information in the cloud if you pick “cold” storage. “Cold storage” means that your data is archived, and when you request access to it, a certain amount of time must pass, from a couple of hours to days. On the contrary, “hot storage” means that you can get your information instantly or almost instantly. The longer it would take to access your data, the lower the storage price will be. For the companies that plan to use Machine Learning solutions in the future, it makes sense to leverage “cold storage” in the cloud.
Keep in mind that for the ML solution, you need to have information for at least one year. A complete understanding of a system must be obtained, including seasonal information, how the system operates on holidays, the behavior of a system during failures or accidents. You can always get rid of unnecessary information, but having the most precise raw data for a short period probably won’t work for Machine Learning.
Select data storage with ML in mind
When you are at the design stage of the application development and planning to use ML in the future, think through what data types would your application and ML algorithms most probably deal with and choose data storage appropriately:
- For structured data (data collected from sensors, weblogs, network data, tabular data sources, etc.), consider relational databases over NoSQL ones, as multiple ML algorithms work best with tabular data by design. You might still store structured data in NoSQL databases, but keep in mind that the structure of the information is not always the same in schema-less databases. That means some features might be missing for a subset of historical data records, affecting ML algorithm performance as a result.
- For unstructured data, where information doesn’t have a predefined data model (e.g., schema-less information, text, images, audio, etc.), consider using NoSQL databases or data lakes. Be prepared to spend additional efforts later if you decide to transform data into a data warehouse to make more sense of it for a specific business purpose.
- The majority of data in IoT software systems come in unstructured formats. In these cases, experts need to apply data mining techniques to prepare the data for ML algorithms. It might be a non-trivial and time-consuming task sometimes because unstructured data is hard to analyze. Making sense of it often involves examining individual pieces of data to pick out potential features and then understand if those features occur in other data entries within the entire dataset.
- If you plan to implement real-time Machine Learning in your system i.e., you need to handle a continuous flow of transactional data on the fly, consider storage that supports the streaming of real-time events.
Conclusion
Data is one of the most impactful business drivers nowadays, which means that the collection of significant amounts of quality data might be a key to ML-powered business solutions for your company to benefit from.
If you’re not sure your software system takes that into account and accumulates enough amounts of quality data, or you’re just planning to start collecting data for future ML-powered solutions, consider reaching out to in-house or external ML experts. They could help software engineers build a data collection pipeline that suits your business objectives the best. Proficient ML experts will also provide valuable insights on what information needs to be collected in your particular case to benefit from an Anomaly Detection or any other ML solution.