10 ML Gotchas for AI/ML Security!

A top 10 list of mistakes that folks practicing AI/ML Security make, explained, with best practices to follow to avoid or correct.

Published in

AI/ML at Symantec

15 min readMar 28, 2019

In this post we will examine 10 of the most common misunderstandings about AI/ML, particularly as they impact cybersecurity practices. Save yourself 3.5 hours of online reading: the AI/ML myths described in the blogosphere tend to fall into 6 clusters (the outer ring in illustration above). I will intersperse that with 4 out of many technical/ research-oriented gotchas, drawing from 23 years of experience building detectors for the implantable medical device and cybersecurity industries (the inner ring).

1. AI/ML Semantics Gotchas

Much ado has been made about trying to differentiate AI from machine learning, ML from neural networks, NNs from deep learning, and so forth. Then there is the question of whether AI itself is just automation, or a magic blackbox, or imitation of the human brain, or autonomous agency, etc. Without resorting to yet another Venn diagram, I prefer to think of ML as the “practical subset of AI that works.” Broadly speaking, ML synthesizes a mapping from inputs to outputs by being shown examples (patterns and possibly targets) and getting “experience” from that, as opposed to having the mapping explicitly programmed line by line in computer code. As overheard at the most recent RSA conference,

“AI is written in PowerPoint; ML is written in Python.”

If a little more pedantry is required, then simply think of set-theoretic symbols: AI ⊃ ML ⊃ DL, and AI\ML = ES where the latter refers to expert systems (rule-based symbolic reasoning) that dominated the early days of computational intelligence.

2. Big Data Gotchas

A classic misconception is that big data (more of it and of any kind) is always better for AI/ML models. One area of concern is that unqualified use of big data can amplify implicit biases [Parker-Wood]. In general, we might attach double the gravitas to big models, yet ironically obtain “doubly wrong” conclusions from them. Remember Google Flu? Years ago, it was incorrectly predicting more than double the proportion of flu cases compared to CDC. Big exploratory analysis was misinterpreted as being predictive. It is useful to see any data analysis as falling into one of 6 categories [Leek & Peng]: (1) descriptive (e.g., the Census), (2) exploratory, (3) inferential (will population pattern hold beyond the dataset?), (4) predictive (inferential at individual level; typical ML), (5) causal (on-average effect; e.g., smoking → cancer), and (6) mechanistic (deterministic effect). Then per Leek & Peng, the most common error in data analysis is mistaking the category of the question. So in security as in science, we might see correlation confused with causation, uncorrected multi-comparison fishing expeditions, overfitting, and analyses with “n=1 cases.” Big data has brought us a world of amazing analytics, but that doesn’t mean it can defy principled statistics and reason.

A global intelligence network of big data does confer technical and marketing advantages to security vendors — the more trillions of data points (and their interrelationships), the more we can “know.” Since algorithms are increasingly open-sourced, it has become more about who has “more” and “better” data. But some competitors may suffer from data “endowment effects.” Having more data at the expense of quality doesn’t mean better models — the good ol’ garbage-in-garbage-out adage still applies. A more fundamental limitation is that the threat landscape is ever changing, so if we shorten the time window of observation to counteract or “freeze” its time-varying nature, we end up with smaller, not bigger data.

While big data would seem to make Bayesian reasoning moot (no need to estimate a revised posterior when the prior is already all data ever to be seen), we can’t really aggregate all data in the face of time-varying rules of the game. Thus, reasoning with small data will remain crucially important. In the medium term, lots of data may not be needed thanks to transfer learning and weakly supervised active learning [Berg]. In the long term, machine intelligence will benefit most from small-data friendly forms of learning including one/few-shot, learning-to-learn, and causality [Pearl].

3. Adversarial ML Gotchas

You must have heard by now that ML models are vulnerable to adversarial inputs conducive to seemingly “absurd” errors (to humans anyway). As automated ML classifiers pervade across domains including cybersecurity, these purposeful errors could grow to be a serious problem. Despite numerous studies over the past five years, the field of adversarial ML is still considered “proto-alchemy,” with no unbroken defenses demonstrated to date [Evans keynote; Carlini & Wagner]. The defense side of adversarial ML has tried to answer a blend of 2 questions: (a) How can we robustify a model (make it harder for an attacker to fool)? This led to “adversarial training, defensive distillation, feature squeezing, and architecture modification”; and (b) What about adversarial inputs is measurably different from regular ones? This led to “input validators” and “adversarial detectors.” By omissions in the current discourse, these methods created the illusion that we could one day pre-bake a solution at training time that would protect our model against one-or-few-off adversarial inputs at deployment time. Our investigations at Symantec strongly suggest the latter goal is unrealizable.

As a former academic, I tend to obsess about pedagogy and like to indulge in creating scientific animations that help illustrate concepts without graphs or formulas. In the two videos below (~90s each), you’ll “wear goggles” that let you see what the ML model sees, and henceforth understand what happens to our detector in its regular, intended happy path, vs in the classic adversarial detection experiment. The main takeaway is that what’s really being detected from the appearance of an input sample alone is the model’s difficulty in handling the input sample (i.e., its own uncertainty), not the intentionality of an adversarial actor. We’re better off striving to get our technologies ever closer to Bayesian optimality [cf. Gal & Smith], and adversarial training is one step in that direction provided we don’t fall into the trap of “over-robustifying” for cases that don’t really match the threat environment.

Scientific animation “Detection of regular samples.” Credit: Own work.

Scientific animation “The problem with detection of adversarial samples.” Credit: Own work.

4. Ad Hoc Ensembling Gotchas

As described in another post, ensembling can offer powerful ways to combine models in order to improve malware detection [Kenemer]. Problems arise, however, whenever ensembling is done in an ad hoc rather than principled manner. A common pattern seen in cybersecurity and other industries is one where a research team develops a “textbook” model, correct under ideal conditions, then during deployment the model doesn’t work as expected because real life doesn’t match the assumptions, data are messier and non-stationary, etc. At that point, manually added rules to patch up the deficiencies start creeping up. One example is the use of Boolean ensembling where the binary decisions of M separate detectors are combined by logical disjunction (OR = “any” rule). A rationale for this is to make sure true positives (TPs ) aren’t lost from one training period to the next, while any false positives (FPs) introduced could still be handled by whitelisting. However, it can be shown that this OR rule has TPr better than the best individual detector, but simultaneously has FPr worse than the worst individual, and this worsens as the size of the ensemble grows. Thus, the whitelists and ad hockery required to compensate for this behavior can grow bigger than the underlying decision models themselves!

It is known that taking the convex hull of individual ROC operating points yields a better classifier, however that is still rarely the optimal solution. In seminal work, Barreno et al. show that the optimal Neyman-Pearson solution that can be extracted from M binary detectors is a computationally hard (impractical) piecewise linear function along likelihood ratio intervals. A more realistic practice is to use linear weighted voting combinations (e.g., linear or logistic regression). Even though combining binary decisions may still be suboptimal, combining the soft (not binary) probabilistic outputs of the models tends to get pretty close to the best achievable ROC. Research at Symantec revealed Boolean ROC operating points actually live in a “canopy” formed by all 2^(2^M) possible Boolean rules, whose “Markovitz-efficient” frontier corresponds to Barreno’s optimal. The figure below shows the subset of this canopy corresponding to OR rule for 3 base detectors. The logistic ensembler (cyan curve) surpasses the frontier and almost reaches the best possible solution (gold curve).

ROC “canopy” of Boolean OR rule and a better linear ensembler. Image credit: Own work.

Unfortunately, a problem remains: combining expert models that start from the same input features and end up with correlated errors tends to yield no incremental benefit. But at least by using a principled version of ensembling you won’t be eliminating the possibility of finding a better combination from the outset.

5. Dystopian Views Gotchas

“Success in creating effective AI could be the biggest event in the history of our civilization. Or the worst.”
— Stephen Hawking

If the latter doomsday case is in store, a bit ironically, we may not have time to see it, as world-renowned Prof. Hawking warned we only have about 100 years left to leave Earth. These profoundly sobering predictions were conditional on humans not learning or being able to control climate, over-population, disease, and in the case of AI, its malicious use in autonomous weaponry and oppression of the masses [Kharpal].

Other dystopian views revolve around whether AI will replace people, taking over their jobs, or not people but only their mundane tasks. The answer is probably neither. AI/ML isn’t just about replacing mundane tasks, but also augmenting human capabilities and imagination [Mahadevan].

Further, as in past technology-driven economic shifts, new skillset requirements, transient displacement in the adapting job market, and suitable retraining programs will continue to move forward. In the early days of AI, an idea was floating around that we’d all one day be [insert your preferred leisurely gerund here], while the machines worked for us. Despite incredible progress in automation, what I’ve experienced has been the opposite — ever-increasing workloads.

So the above are wake-up calls for those who welcome AI/ML as strictly for-good, as well as those who instill fear about it. AI/ML is a dual-use technology (for good and for bad), and it will primarily help us protect customers in security, but also secondarily increase the attack surface when malicious actors attempt to use data poisoning, model stealing, and adversarial ML against us [Gardner, ITU keynote]. We should continue to focus on the bright side without being oblivious to the dark side. A good way to think about this is English comedian Eric Idle’s response when asked if he was worried about AI:

“I’m worried about Artificial Stupidity” — Eric Idle

6. State-of-the-Art Gotchas

Every AI/ML security project must ensure that the latest state-of-the-art (SoTA) model is implemented, right? Not so fast! At Symantec, we have seen the perils of both ignoring/postponing SoTA, and too eagerly incorporating it. Soon after the Center for Advanced Machine Learning (CAML) was established in 2014, it became clear that some systems in production had been reinventing ML wheels, yet were suboptimal. Dramatic improvements affecting over 100 million endpoints were made by largely returning to ML “textbook” principles, applicable to static and behavioral malware detection.

On the other hand, we have seen that going to the other extreme of eager SoTA adoption results in folks being constantly distracted by new shiny objects, copy-pasting GitHub code without more than a passing understanding of its theoretical underpinnings, and the accumulation of technical [Sculley et al.] and research debt. Whereas open-sourced research and MOOCs promise to democratize AI/ML, others opine AI/ML security would be better left to experts. For example, unlike open-source code to detect a cat in a video, a security log file is far less obvious to identify as suspicious, so the model can’t be simply “set and forget” [Malanov]. Furthermore, SoTA models are usually overfit [Rasmus]. I call this “ML one-trick ponies,” where the hello-world example works, but the variants you’re actually interested in, don’t.

Part of the SoTA dilemma may stem from the difference in workflows between academia and industry. As cogently described by Rothe, the academic research tends to reward beating a flexible performance metric, whereas the industrial one starts from fixed requirements and works backwards to a solution. The best practice seems to be to strike a balance between the publicly visible state of the art and custom in-house developments.

7. Precision-Recall vs ROC Gotchas

Every time I receive a link about someone swearing that precision-recall (PR) curves are superior to ROC curves, it’s “OK, another one good for a chuckle.” So let’s put it to rest, shall we: for a given dataset, there is a one-to-one correspondence between PR and ROC spaces, such that each one contains the other’s same confusion matrices [Davis & Goadrich]. The reason some folks are attracted to PR is that in problems with highly skewed class imbalance it might amplify aspects that seem less obvious in ROC. The caveat they don’t mention is that precision (PPV; the P in PR) is hopelessly tied to whatever proportion of negative vs positive class samples happened to be in effect during the experiment (which may differ from the finally deployed classifier’s proportion encountered in real life). This is in contrast with Sensitivity (TPr) and Specificity (1 minus FPr), which are independent of class priors by definition. To see the relation between Sens =P(1̂|1) and positive predictive value PPV =P(1|1̂), use Bayes rule or algebra to obtain PPV = Sens*P(1)/P(1̂), and therein lies the issue (the P(1) proportionality). There is no such issue with recall (=Sens; the R in PR).

In otherwise sensible security literature you might find downright false statements such as “ROC curves are misleading in an imbalanced dataset.” They’re only misleading during a MISreading of the curves! What the cited references are really talking about is that area under the curve (AUC; not the curves in and of themselves) can be misleading when ROC curves cross each other and we have other costs in mind, which is true. But in malware detection, where very low FPr is absolutely required, we know better than to rely on the whole AUC (if anything, a truncated version would be used). ROC curves are agnostic to class imbalance [Fawcett] and that’s why we tend to prefer them more often than PR curves at Symantec. Love the class-prevalence agnosis and let the user or application decide on a detection threshold that makes sense to them, rather than implicitly baking one into a P(1) that may or may not hold.

8. Covariate Shift Gotchas

From time to time in ML security, when a testing dataset doesn’t match the statistics of a training dataset, people may be quick to point to covariate shift as the culprit. But chances are that’s probably not the case! Think of joint distribution p(.) as a function providing the statistical “glue” between a bunch of variables, describing how frequently or rarely you’d find any particular configuration of co-occurring values. Let A(x,y) be the joint of features x and labels y during training (the past), and let B(x,y) be the joint during testing (the future/ deployment/ production). The problem of learning a classifier from A(x,y) when B(x,y) will be different is called domain adaptation. A joint can change for many reasons, to wit:

(1) The whole problem is time-varying p(x(t),y(t)) (or tougher with separate t). To go beyond frequent classifier retraining, we’d need to learn dynamics, e.g., via reinforcement learning or prognostics.

(2) Features are changing p(x(t),y), e.g., from feature selection or from cybersecurity multi-label classifiers used as features.

(3) Class labels are changing p(x,y(t)), as in cybersecurity reputation systems.

(4) Class label distribution P(y)=Pr[y] (math purists: ignore notation abuses; blog is a trade-off), i.e., good-vs-bad proportion, differs but class-conditionals stay put: A(x|y) = B(x|y). This is called class imbalance. From product rule, p(x,y) =p(x|y)P(y), so B(x,y) =A(x|y)B(y) =A(x|y)A(y)B(y)/A(y) =A(x,y)B(y)/A(y). Thus, to match future B(x,y) from our current A(x,y) (represented empirically by the training set), we can rescale by the label-dependent ratio B(y)/A(y). E.g., during training do instance-weighting or good ol’ oversampling/under-sampling according to that ratio (= a constant B(1)/A(1) for bad samples and another one =B(0)/A(0) for good samples). Equivalently, simply slide the detection threshold to a desired FPr when the model was trained balanced — a preferred best practice at Symantec.

(5) Feature distribution P(x) differs but class-posteriors stay put: A(y|x) =B(y|x). This is called covariate shift [Shimodaira]. Folks think of this as input features having a different mean and variance in train vs test, but it’s more nuanced than that: the P(y|x) relation has to be preserved. From product rule, p(x,y) =P(y|x)P(x), so B(x,y) =A(y|x)B(x) =A(y|x)A(x)B(x)/A(x) =A(x,y)B(x)/A(x). Thus, to match future B(x,y) from our current A(x,y), we can reshape by the input-dependent ratio B(x)/A(x) and attempt instance weighting. However, it can be difficult to estimate this ratio. Since AI/ML models tend to be big and complex, covariate shift where the relation P(y|x) remains intact is a much less likely explanation than (1) thru (3) above.

9. “Next-Gen” Cybersecurity Gotchas

The last few years witnessed a rise in guerrilla tactics among so-called “next-generation” startups driving a wedge between the old AV establishment and a new industry category that purportedly basically “invented” machine learning for cybersecurity. Some aggressive marketing included declaring that signatures are dead, and detecting bogus repacked files as malware while turning off competitors’ protections [Gallagher]. Bolder claims included being able to protect against 0-day malware without updates for 6 months, and that old AV doesn’t use ML. In a now deleted post, it was declared that AV-TEST could confirm the general marketing claims of next-gen vendors. But buyer beware misleading words. First, signatures aren’t really going away soon as an estimated 30% of all protection, including that of next-gens themselves, is still just that (and why wouldn’t any AV company quickly screen against recurring threats they know about?). Second, in the alleged AV-TEST there was no measure or even mention of FPs! It also described how they had to introduce a “new” test by freezing the product and testing 7 days later for 0-days. It’s called retrospective testing and it’s older that Auld Lang Syne!

Part of the confusion created by these claims arises from the fact that there’s actually a continuum from signature/ fingerprint type of detection, to generic/ heuristic definitions that cover 100s of thousands of variants (whether we had seen them before or not) starting to look like decision-tree rules, to a more complete abstraction of rules such as deep neural networks and higher AI. The beginning of ML use in the AV industry happened the moment rule-bases couldn’t be written by hand anymore due to the exploding threat landscape, and that was some 15 years ago [Malanov].

It is useful to recall the 9 principles of testing from the Anti-Malware Testing Standards Organization. Tests should be: (1) not harmful, (2) unbiased, (3) open and transparent, (4) balanced regarding TPs and FPs, (5) validated regarding labels, (6) consistent regarding consumer vs enterprise, (7) conclusive based on evidence, (8) statistically valid, and (9) responsive to questions via correspondents. The solution to bad AV-TEST debuts for any aspiring cybersecurity company should not be to inject confusion, blame the test organizations, and violate all 9 principles above.

10. Not-There-Yet Gotchas

Our final source of nonstop mythology around AI/ML is especially common in security lore: demanding that a trained model test well on 0-day samples, because after all, we don’t care as much about malicious samples in the training set (those are already “known” to be bad). There is a double error going on here. First, there is inherent value in a model that covers 10s of millions of “known” samples in a compressed way, since the equivalent signature-based whitelist + blacklist would be bigger than the model itself.

Second and more importantly, asking for accurate 0-day coverage is like moving from classic statistical ML to the realm of “magic extrapolation.” The former teaches us to train on a distribution and hope to deploy on samples drawn from that same (or close-enough) distribution. If the distribution has continuous variables, repetitions are extremely unlikely and in that sense new samples are always 0-days. But if the distribution is discrete and happens to have {banana, orange, apple} in its domain, then those tokens will typically recur in the future because that is what the distribution is. The issue arises when we train on {banana, orange, apple} and then expect the model to “know” the right answer when a {mango} comes along. Let’s face it, current ML security doesn’t yet reason; it magnificently interpolates and sometimes “luckily extrapolates,” but that’s all predicated on a 0-day not being too far off the learned similarities captured in the model.

In reference to deep learning, Judea Pearl’s The Book of Why interview has described this as amounting to just “curve fitting.” Technologies that “aren’t there yet” in security include cloudless client-side learning, replacements for the industry-standard multi-layered approach to security [Malanov], reinforcement learning, and causality modeling. When we have that, we’ll be in a better position to ask for magic extrapolation.

In closing, I hope to have helped you see the 6 broad categories of misunderstandings surrounding AI/ML security, but also challenged you with 4 out of many technical, and at times controversial, sources of confusion. Some of the latter ideas, including the limits of hardening against adversarial ML, are part of ongoing cutting-edge research at Symantec.