NIPS Day 2: Is “Alchemy” the New (Batch) Norm?

Building “Digital Medicines” with ML + Genomics

At the beginning of his 9am talk, Brendan Frey told us a story: of a time when he and his wife had been informed that their unborn child may or may not have a serious genetic defect, but that the evidence wasn’t strong enough to be sure. It was that experience, which Frey describes as frustrating and enormously emotionally wrenching, that motivated him to bring his years training in ML — he studied under Geoff Hinton at the University of Toronto — to bear on enhancing the state of human knowledge about our own genomes.

Frey went on to explain models that fit into the broad category of: models that use raw genetic data to predict higher level biological behavior (for example: protein binding), and then use that trained model to evaluate hypotheses at scale. This is a meaningful contrast from a more theoretically fueled view, where scientists develop concrete, mechanistically clear rules of behavior, and those rules are then the standard for evaluating the quality of potential hypotheses. His dream is that models like these will be able to radically speed up the process of doing wet lab testing, by giving scientists a very well prioritized list of genetic sequences that might, for example, trigger gene regulation control logic that counteracts the effect of a genetic defect. He gave an example of this procedure with the wildly effective “digital drug” Spinraza, that introduces a reverse complement molecule to bind to a region within the intron that controls Spinal Muscular Atrophy. When he trained a model on this problem, and naively gave it a range of hundreds of potential sequences, the sequence corresponding to the Spinraza drug was ranked third on that list. While I haven’t independently verified that the training procedure Frey followed makes sense, if we assume it does, this is a powerful argument that models like these might have a role to play in testing prioritization.

An attendee in the audience brought up a fairly predictable critique of this approach: methods like these can only discover correlational effects, not causal ones. Frey’s response was essentially a paraphrase of the immortal xkcd view on the subject: “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’ ”. And I think the broader point is a good one. Obviously, we need to be appropriately humble, and not start shooting people up with molecules that have only been proven in a simulated, in silico setting. But we also shouldn’t fetter ourselves, and stare at the floor, and say that, “well, because we can’t solve this problem completely, it’s not worth attempting”.

The drama of the day: accusations of “alchemy”

The first half of Ali Rahimi’s talk was a straightforward technical presentation of his 2007 paper on Random Features for Large Scale Kernel Machines, which won the “Test of Time” award, for still being relevant 10 years later. I admit that I haven’t had a chance to read the paper yet, but it seems like an interesting approach using randomization to make a kernel machine problem (see: SVMs, where we use kernel functions in place of typical euclidean distance to simulate projecting our variables into a higher-dimensional space) more computationally tractable.

But, it was the second half of the talk that got everyone’s attention. It pivoted off of mentioning that, back in 2007, he’d been scared to put his paper up for consideration while there was still some part of its functional mechanism that they didn’t understand (though, they eventually did a few years later), due to the “NIPS Rigor Police,” his joking name for the tendency for there to be people at the conference who would ruthlessly attack any paper that relied on fuzziness, or non-understood mechanisms, to make its point.

Rahimi continued to criticize the current Machine Learning community, comparing it to the practice of alchemy: dramatic in it’s ambitions, successful in some of it’s practical goals (glassmaking, metallurgy), but without solid theoretical foundations grounding those ambitions. His main concern was that, without clear mechanistic rationales for why our techniques work, the ML community would end up susceptible to small variations in the process of “throwing a brittle optimization algorithm at a loss space you don’t understand”. Another example he gave was Batch Norm, a core part of training modern ML models.

Here is what we know about Batch Norm: it works because it reduces internal covariate shift. Why does reducing internal covariate shift speed up training? Wouldn’t you like to see evidence that that’s what happening? Wouldn’t you like a theorem? Wouldn’t you like to know what internal covariate shift is? Wouldn’t you like a definition of it?

In a lot of ways, the narrative thrust of this talk was opposite to the one that came before it. Frey’s talk (implicitly) pushed for finding solutions that could be useful, even if they weren’t scientifically principled (though, to be fair to Frey, he said he was interested in more interpretable models). Rahimi, by contrast, warned against simply chasing good results without a good foundation. I think there’s honestly value in both of these viewpoints: we in the ML community don’t want to be charlatans, but we also don’t want to be hermits. Having people in the community willing push in both directions strikes me as healthy.

The Trouble With Bias

The last invited talk of the day was Kate Crawford’s, on the problem of fairness and bias in our models. I’m honestly a little disappointed by this one, since I hoped for more technical insight coming out of the talk. All told, it was pretty much less rigorous rehash of the excellent Fairness workshop from the prior day. The main idea introduced by Crawford’s talk was the distinction between allocative and representative harm, where the first is “harm that derives from depriving useful resources from groups, because those resources are gated by models”, and the second is “the way that the outputs of models contribute to continued negative representations of minority groups”. This is a reasonable thing to consider, because, sure, it can be painful to people to be represented in offensive ways (see below). But being aware of that doesn’t make us any more capable of solving the problem.

An oft-cited example of machine bias, where an African American woman is categorized as a “gorilla” by a Google image classification model

To get up on my own soapbox for a moment, I think the way we talk about machine “bias” and fairness isn’t particularly useful. Whether we’re talking about word embeddings that assign gendered weighting to words related to domesticity, or image labeling algorithms like the one above, I think the better way to frame the problem is that we’d like our models to be a better representation of ourselves. It is just a fact that the samples of, for example, textual data that exist in our world will embed the gendered realities and assumptions of the people who speak and write that language. And, we may (nobly! correctly!) want to remove those gendered assumptions from the tools we create, to achieve a normative goal. But that’s fundamentally not a case of the model doing something “wrong”, it’s more the case of humans by default doing something wrong, and us wishing our models were the version of ourselves that we’d prefer to be. Similarly with the labeling problem in the image above: nothing about the model’s construction makes it more likely to make offensive errors, it just has no way of knowing that humans would find those errors offensive. And there’s no real way one could expect it to know that, without being explicitly told. Fairness and bias, in my view, are more about deciding what standards we want to hold our tools to than about how to technically make our tools “correct”, since normative goals can’t really be considered “correct” or “incorrect”

Some other interesting ideas

  • Multi speaker embeddings: When doing text to speech in multiple voices, add a speaker embedding on top of WaveNet, to be able to smoothly interpolate between types of voices (like male to female)
  • Capsule Network: A proposed way to train image recognition models in a way that respects the relative position of parts within a whole (e.g. uses the information of how eyes and mouths are part of the same whole). You’ve probably heard of this because it is a big project of Geoff Hinton’s
  • A Unified Approach to Interpreting Model Predictions: Suggested new kern for LIME, a way of creating explanations for black box models, to make it more consistent with human intuitions of variable assignment

New Reading List Entries

Picture of the Day

The early conquerer’s of the conference center’s limited outlets, holding onto their prize

Quote of the Day: “Have you ever spent weeks working on a model architecture, only to have it fail to converge? I know, it feels bad. But, I’m here to tell you, it’s not you. I don’t think it’s your fault. I think it’s gradient descent’s fault.”

Tweet of the Day

Mood of the Day: “So, uh, how well do *you* understand how Batch Norm works?”