Summary: Interpreting Neural Networks With Nearest Neighbors (EMNLP-ws 2018)
[1809.02847] Interpreting Neural Networks With Nearest Neighbors
Abstract: Local model interpretation methods explain individual predictions by assigning an importance value to each…
They discuss a number of limitations for saliency-based interpretations. In particular, a neural network’s confidence can be unreasonably high even when the input is void of any predictive information. Therefore, when removing features with a method like Leave One Out, the change in confidence may not properly reflect whether the “important” input features have been removed. Consequently, interpretation methods that rely on confidence may fail due to issues in the underlying model.
They address this by changing the test-time behavior of neural networks using Deep k-Nearest Neighbors, which provides a more robust uncertainty metric: conformity without harming classification accuracy. They use the conformity metric to generate feature importance values.
They find the resulting interpretations better align with human perception than baseline methods: leave-one-out and gradient-based feature attribution. They also use their interpretation method to analyze model predictions on SNLI dataset annotation artifacts.