Surprise Test for Deep Learning Model

Neel K.
Analytics Vidhya
Published in
5 min readMar 31, 2020

Machine learning models like other software are vulnerable to wrong outputs and adversarial attacks. It is quite necessary to check for their correctness and robustness through various tests before deployment.

Picture from JetRuby.com

Which kind of tests?

Some common test approaches are A/B testing which is an audience-specific test, Smoke test (build test) is commonly known as a day zero test. Another kind of test, Sanity test (release test) is conducted before delivering software for acceptance testing such as alpha, beta and gamma tests. Here is a brief categorization of these approaches.

Black-Box Testing:

Traditional approaches like black-box testing are common for software testing function/blocks. In black-box testing we take as a box with is opaque we just apply several inputs and check whether the box responds in the way it is designed. This technique is also known as Input-Output Testing, behavioral testing, and specification-based testing. Mainly because it determines the functionality of blocks. Few common methods of this approach are Equivalent testing, Boundary Value Analysis, Decision Table & State Transition.

White-Box Testing:

Contrary to black-box testing here box is transparent where we observe each tine element and its conditioning to input. It is also known as structural testing, Glass box testing, Clear-Box testing, and open testing. It mainly checks for internal codes and security. Some popular approaches are path coverage, statement coverage, and branch coverage.

So, we have these techniques why do we need something special?

Traditional techniques cannot work properly on deep learning systems mainly because they contain very complex structures where every neural relates to thousands of other neurons, making it difficult to visualize. Two most common assumptions used by previous state-of-the-art techniques for verification of DL systems are:

1. Metaphoric Testing: Every similar input must have similar output. Ex. DeepTest

2. Neuron Coverage: More the diverse inputs lead to more effective tests. Ex. DeepXplore

In neuron coverage [Pei et al,2019] we take a ratio of neurons that activate which a certain threshold for input. It does not fully satisfy the first assumption so another method, DeepGauge [Lei Ma et al] proposes K-multisection neuron coverage. In this method, neurons are divided into k-buckets and only those buckets are taken into account where the neuron is activated for certain inputs. It is harder to evaluate for any input which triggers neurons out of the bucket.

Surprise Adequacy for Deep Learning Systems (SADL)

SADL proposed by Jinhan Kim et al measures how surprising input is to a DL system. The method computes a fine-grained test adequacy metric using various approached. Primarily, it estimates surprise on inputs provided during the training. It also finds the likelihood of pdf extrapolated during kernel density estimation. Lastly, it computes the difference between the traces of each neuron. If it is difficult to understand here is an illustration.

  1. Surprise Adequacy (SA):

Let's say x is input from set X then activation trace is a vector which can be written as AT(X) = {α(x) | x ∈ X} after activation is applied. Activation trace aims to measure relative novelty for given inputs based on the input used for training. Likelihood SA uses Kernel Density Estimation (KDE) to obtain surprise for every input concerning estimated density.

Neuron from selected layers with variance higher than threshold and computed using LSA(x) = −log( ˆf(x)).

Secondly, Distance-based Surprise Adequacy (DSA) aims to measure the distance between AT using Euclidean distance. It is suitable to exploit boundary, more effective than LSA for Classification tasks.

Whereas it is not suitable for no class boundary problems (Steering Angle in Self Driving).

2. Surprise Coverage (SC)

Surprise Coverage calculates coverage for a set of inputs using Surprise Adequacy simply by dividing into segments. SC can be used by discretizing previous approaches LSC & DSC. It can be measured w.r.t predefined upper bound.

Results:

Results presented in the paper show that tests were conducted on four datasets MNIST, CIFAR-10, Dave-2 & Chauffer with five different adversarial attack methodologies. SADL can capture the relatively surprise of inputs where higher SA are hard to correctly classify.

Picture from Research Paper (Kim et al )
Picture from Research Paper (Kim et al )
Picture from Research Paper (Kim et al )

Is it an important segment of the market?

According to the World Quality Report 19–20.

  1. 56% of software development firms lack an appropriate testing environment.
  2. 19% of software houses use automated testing frameworks.
  3. 27% of developers lack the test environment and test data
  4. 12% of software engineers believe to have specific test environments for AI algorithm by next year.

Here is the chart showing how market changes over time because of newer more efficient & time-saving cloud testing platforms such as DevOps and Agile Management.

Graph Courtesy: https://www.sogeti.com/explore/reports/world-quality-report-2019/

References:

Thank you for reading.

--

--