Memorization and Deep Neural Networks

Published in

Analytics Vidhya

6 min readMar 29, 2021

In the field of data science and specifically in machine learning, it’s not uncommon to wonder “what happens in the hidden layers?” when thinking about deep neural networks. These networks can appear to be “black box” algorithms as a lot of the internal computation is not currently well-understood, and a lack of understanding of these behaviors can be costly when it comes to time, hardware, and finances especially in the realm of “Big Data”. In “A Closer Look at Memorization in Deep Networks,” Arpit et al. examine the differences in modeling real and random data with DNNs in order to gain some understanding of how these models detect patterns and what factors influence model performance.

Before we proceed, I’ve compiled a short (and non-exhaustive) list of frequently used terms relevant to researching and understanding deep neural networks:

Regularization — an important concept in many mathematical contexts, regularization means applying a ‘penalty’ to a function to control excessive fluctuation.

Memorization — essentially overfitting, memorization means a model’s inability to generalize to unseen data. The model has been over-structured to fit the data it is learning from. Memorization is more likely to occur in the deeper hidden layers of a DNN.

Capacity — “the range of the types of functions the model can approximate.” A high-capacity model is highly complex and can be prone to overfitting, which is related to its memorization of ‘seen’ data. A low-capacity model may have trouble generalizing very complex data but is not necessarily inferior because of computational efficiency. The key here is balance and understanding the dataset!

Effective capacity — model performance given a specific algorithm and specific set of data.

Adversarial example — a data point that can be “confidently misclassified,” or a point that can be mistaken to belong in one group when it, in fact, belongs to another.

Now, on to take a closer look at deep neural nets!

Data

While this may seem intuitive, one of the biggest takeaways from the research detailed in this paper is that a model’s ability to generalize is largely impacted by the data itself. “Training data itself plays an important role in determining the degree of memorization.” DNNs are able to fit purely random information, which begs the question of whether this also occurs with real data. In this study, the researchers analyze the performance of several models on real data (MNIST and CIFAR-10) and real data with varying degrees of randomly-generated noise (between 20 and 80% of the real dataset).

Two primary metrics were used to analyze model performance — loss sensitivity and critical sample ratio (CSR). Model capacity as an effect on validation performance is also examined, as are some regularization techniques.

Loss Sensitivity

Loss sensitivity can be explained in technical terms as “the magnitude of the derivative of the loss gradient with respect to x.” In other words, how much does a change in data point x impact the change of the loss function? The aim of gradient descent is to minimize the loss function, that is find the place on the loss function where its change is zero. High loss sensitivity means that the change in loss function is highly impacted by a change in x, or that as x changes, the steps taken to minimize the loss function are large. The results found here indicate that only portions of real data had high loss sensitivity, while noisy data was extremely sensitive to the changes in x. What this tells us is that the more noise that exists in a dataset, the more likely the model is to take large steps to minimize loss. Because real data will more often have distinguishable patterns and those patterns will necessitate taking smaller and smaller steps, the model is likely not memorizing it the way it does with random data (or at least not in the same way) .

Critical Sample Ratio

This study defines a critical sample as a subset of the data of a certain group where there might be crossover from another group— i.e., the boundary that separates this subset of one group from data of another group is much harder to determine, and thus much harder to generalize differences with a model.

To identify a critical sample, a box with radius of size r is used to determine the density of boundaries — imagine the box around a specific data point belonging to group A. This data point is a critical sample if there is another data point from a different group within the radius of this box. This determines the complexity of the data or how closely related different groups are, and from that, how easily they might be mistaken for belonging to the wrong group. This point of a different group is considered an adversarial example.

“Adversarial examples originally referred to imperceptibly perturbed datapoints that are confidently misclassified. (Miyato et al., 2015) define virtual adversarial examples via changes in the predictive distribution instead” (Arpit et al. (2017)) — this extends the definition of CSRs to unlabeled data — or data whose groups are not defined prior to modeling.

The more critical samples there are, the more complex the hypothesis space is and the higher the CSR is, which is the ratio of critical sample subsets to the entire dataset. This research describes that critical sample ratios increase for models with higher levels of noise — that these models are learning more complex patterns with noisy data, and that validation accuracy (accuracy of predicting unknown information) is generally lower for increasing levels of noise. Following on this logic, this says that these models are sensitive to random data because they have to memorize specific points instead of generalizing to them due to the potential of misclassifying.

Capacity

Higher capacity models seem to be capable of learning noise without compromising learning the real patterns of the data set, but this capacity isn’t as necessary if the data is cleaner and has had outliers and noise removed. This underlines the importance of having some understanding of the data as well as using the appropriate cleaning and preprocessing methods to minimize computational cost and reduce unnecessary complexity.

There is a bigger reduction in ‘time to convergence’ of data that contains more noise, or how long it takes for the model to reach a place where more training will not show improvement. This follows because of the loss sensitivity to noisy data. That is, there is more to ‘learn’ with random information, and therefore the model has to use overfitting strategies, increasing the depth or number of nodes to pick up on noise, and taking larger strides to minimize the loss function. The graph below demonstrates that the time it takes to converge is significantly higher in data with increasing proportions of noise as well as overall in larger datasets.

As the proportion of noise increases, an increased number of units per layer has a greater impact on convergence of noisy data than on clean data. Notice the difference between the yellow and blue plot lines.

Regularization

Regularization is a concept introduced early in many data science pursuits — that is, a penalty for wildly fluctuating models is introduced to reduce said fluctuation. It is an important technique to reducing memorization, as is described in this graph.

The x-axis is a measure of model performance on random data, whereas the y-axis is performance on real data.

Low performance on random data combined with high performance on real data is an indicator of reduced memorization as the model is not able to generalize random noise — dropout regularization seems to be the most effective, followed by adversarial training, or training the model on ‘deceptive input’.

Summary

Understanding and extending upon the basics of gradient descent and regularization will help data scientists describe the more ambiguous behaviors of deep neural networks. Model architecture, optimization, and the data itself all play a particularly important role in unpacking some of these behaviors and having tighter control over complex models can reduce computational costs and help give scientists a clearer roadmap for modeling large datasets.

References:

“Deep Nets Don’t Learn Via Memorization,” Kruger et al., 2017 https://openreview.net/pdf?id=rJv6ZgHYg

“On the Geometry of Generalization and Memorization in Deep Neural Networks,” Stephenson et al., 2021
https://openreview.net/forum?id=V8jrrnwGbuc

http://www.cs.cornell.edu/courses/cs4787/2019sp/notes/lecture13.pdf