Coders Camp
Published in

Coders Camp

Use of Machine Learning in System Security

Security systems are natural targets for malicious tampering because there are often obvious gains for attackers who manage to bypass them. Systems powered by machine learning contain a new attack surface that adversaries can exploit when they have basic knowledge in that space. In this article, I will take you through the use of Machine Learning in System Security.

Hacking the system environments by exploiting the designs or implementation of flaws are nothing new, but if someone fools the statistical models, then it is a different matter altogether. To understand the vulnerabilities of machine learning algorithms, let’s take a look at how the environment in which these techniques are applied affects their performance.

Also, Read — JSON Tutorial with Python.

Security Vulnerabilities in Machine Learning Algorithms

To understand the vulnerabilities of machine learning algorithms, let’s take a look at how the environment in which these techniques are applied affects their performance. By analogy, consider a swimmer who learns and practices swimming in swimming pools all his life.

They will likely be a good swimmer in pools, but if they are suddenly thrown out to sea they might not be equipped with the ability to cope with strong currents and the hostile environment, so are likely to struggle.

Machine learning techniques are generally developed under the assumptions of data stationarity, functional independence and low stochasticity. The training and testing data sets are assumed to be drawn from populations whose distributions do not change over time, and the selected features are assumed to be independently and identically distributed.

Machine learning algorithms are generally not designed to be effective in conflict environments where these assumptions are shattered. Trying to adapt a descriptive and durable model to detect adaptive adversaries who are tricked into avoiding correct classification is a difficult task. Opponents will try to break any assumptions made by practitioners as long as it is the path of least resistance in a system.

A large class of machine learning vulnerabilities results from the fundamental problem of imperfect learning. A machine learning algorithm attempts to fit a hypothesis function that maps points taken from a certain data distribution space into different categories or on a digital spectrum.

As a simple thought experiment, let’s say you want to train a statistical learning agent to recognize cross-site scripting attacks on web applications. The ideal result is an agent capable of detecting all possible permutations of the XSS input with perfect precision and without false positives.

In reality, we will never be able to produce perfectly efficient systems that solve significantly complex problems because the learner cannot receive perfect information. We are unable to provide the learner with a dataset drawn from the entire distribution of all possible XSS inputs.

Therefore, there is a segment of the distribution that we intend to capture by the learner but have not provided enough information for them to learn more. Modelling error is another phenomenon that contributes to the contradictory space of a statistical learner. Statistical learning forms abstract models that describe real data, and a modelling error occurs due to natural imperfections that occur in these trained models.

Even perfect learners can display vulnerabilities because Bayes’ error rate may be non-zero. The Bayes error rate is the lower limit of the possible error for a given combination of a statistical classifier and the set of characteristics used. This error rate is useful for evaluating the quality of a feature set, as well as for measuring the efficiency of a classifier.

The Bayes error rate represents a theoretical limit on the performance of a classifier, which means that even when we provide a classifier with a complete representation of the data, eliminating any source of imperfect learning, there is still a finite set. conflicting samples which may cause classification errors.

The image below illustrates the theoretical data population for which we want to develop a statistical learning model and its relationship to the training and testing data distribution spaces.

Essentially, the training data that we provide to a machine learning algorithm is taken from an incomplete segment of the theoretical distribution space. When the time comes to evaluate the model in the laboratory or the wild, the test set may contain a segment of data whose properties are not captured in the distribution of training data; we call this segment the contradictory space.

Attackers can exploit pockets of conflicting space between the variety of data installed by a statistical learning agent and the theoretical distribution space to trick machine learning algorithms.

Machine learning practitioners and system designers expect training and testing data to be drawn from the same distribution space and further assume that all characteristics of the theoretical distribution are covered by the trained model. These blind spots in machine learning algorithms arise because of the gap between expectation and reality.

Statistical learning models derive information from the data fed into them, and vulnerabilities in these systems naturally arise from data gaps. As practitioners, it is important to ensure that training data is as close as possible to the actual distribution.

Also, Read — 9 Computer Vision Projects for Machine Learning.

At the same time, we need to continuously engage in proactive security defence and be aware of the different attack vectors so that we can design algorithms and systems that are more resistant to attacks. I hope you liked this article on the use of Machine Learning in system security. Feel free to ask your valuable questions in the comments section below.

--

--

--

We are here to guide you from Hello World to Programming Robots. We hope you will learn a lot in your journey towards programming with us.

Recommended from Medium

Feature Comparison of Managed Computer Vision Services

Machine Learning Basics: Model, Cost function and Gradient Descent

What is Bootstrap Sampling in Statistics and Machine Learning?

Machine Learning Metrics: When to Use What

Learning Day 64: Object detection 3 — Fast R-CNN and Faster R-CNN

Activation Function in Neural Network

Gradient Descent — A Beginners Guide

Welcome to Deep Reinforcement Learning Part 1 : DQN

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aman Kharwal

Aman Kharwal

I write stories behind the data📈 | https://www.instagram.com/the.clever.programmer/

More from Medium

Part 1: Learning Python Journey

AWS Machine Learning Engineer Scholarship Program

PYTHON For System Programming — Part 2

Instant vs Model Based Learning