Demystifying Neural Networks: Noise Contrastive Estimation (NCE)

4 min readJan 25, 2024

This article is part of the series Demystifying Neural Networks.

Machine learning models often deal with the task of assigning probabilities to different outcomes or classes. This is crucial for classification problems where the model must predict the correct label for a given input. The softmax function is a popular tool for this, but it comes with computational limitations. Noise Contrastive Estimation (NCE) offers an alternative approach that can be more efficient in certain situations. Let’s dive into it.

Softmax: A Primer

Softmax is a function that takes a vector of real numbers (often representing unnormalized scores or ‘logits’) and transforms them into a probability distribution. In multiclass classification, the output of a neural network is typically passed through a softmax layer. The probabilities computed by softmax represent the model’s confidence in assigning the input to each of the possible classes.

To illustrate, imagine a fruit classifier with three classes: apple, banana, and orange. If the model produces logits [2.0, 1.0, 0.5], the softmax would calculate probabilities as follows:

Probability(apple)  ~= 0.67
Probability(banana) ~= 0.24
Probability(orange) ~= 0.09

The Problem with Softmax

While widely used, softmax has a significant computational bottleneck. Calculating the probabilities involves a normalization step where we sum the exponentials of all the logits. This becomes expensive when the number of classes is very large (e.g., in natural language processing tasks with extensive vocabularies).

Noise Contrastive Estimation (NCE)

NCE offers a clever way to sidestep the expensive normalization step of softmax. The core idea is to train a model to distinguish between “real” data samples and artificially generated “noise” samples.

How NCE Works

Data and Noise: We have a dataset of real data samples. We create noise samples, typically by corrupting real examples or using a known noise distribution.
Discriminative Model: We train a model, often a neural network, to act as a discriminator. The model’s goal is to output a high score for real data samples and a low score for noise samples.
Learning by Comparison: During training, we feed the model pairs of data and noise. The model is optimized to maximize the difference in scores between the real and noise samples. Intuitively, it learns to assign higher probabilities to the real data distribution and lower probabilities to the noise.

NCE vs. Softmax

Computational Efficiency: NCE avoids the expensive normalization over all classes, making it suitable for large-scale problems.
Approximation: NCE provides an approximation to the true probabilities, while softmax calculates them directly. The quality of this approximation can impact model performance.
Focus on Discrimination: NCE concentrates on learning the boundary between real and noise, rather than explicitly modeling the probabilities of each class.

Example

Here’s an example of NCE with PyTorch. The code is available in this colab notebook.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# -- Data Generation -- 
def generate_data(num_samples, noise_std=0.05):
    np.random.seed(123)
    centers = [(0, 0), (2, 2)]  # Centers of the two data clusters
    data, labels = [], []
    for i in range(int(num_samples/2)):
        center_idx = np.random.randint(0, 2)  # Randomly select a cluster center
        center = centers[center_idx]

        minor_noise = noise_std * np.random.randn(2)
        major_noise = 10 * noise_std * np.random.randn(2)
        #print(f'center: {center}, minor_noise: {minor_noise}, major_noise: {major_noise}')

        x, y = center + minor_noise
        data.append([x, y])
        labels.append(center_idx)

        noise_x, noise_y = center + major_noise
        data.append([noise_x, noise_y])
        labels.append(-1) # -1 means noise
    return torch.tensor(data, dtype=torch.float32), torch.tensor(labels)


# --  Discriminative Model --
class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(2, 32)  # Input layer to hidden layer
        self.fc2 = nn.Linear(32, 32)  # Input layer to hidden layer
        self.fc3 = nn.Linear(32, 1)  # Hidden layer to output layer
        self.sigmoid = nn.Sigmoid()  # Sigmoid for binary classification output

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # Apply ReLU activation
        x = torch.relu(self.fc2(x))  # Apply ReLU activation
        x = self.fc3(x)
        return self.sigmoid(x)  # Output between 0 and 1 for probability-like score


# -- Generate data -- 
num_samples = 1000
num_train = 800
num_test = num_samples - num_train
data, labels = generate_data(num_samples)
train_data, train_labels = data[:num_train], labels[:num_train]
test_data, test_labels = data[num_train:], labels[num_train:]

# -- Training --
model = Discriminator()
loss_fn = nn.BCEWithLogitsLoss()  # Built-in loss combines sigmoid and BCE for efficiency
optimizer = optim.Adam(model.parameters())

for epoch in range(100):
    print('epoch:', epoch)
    for x, label in zip(train_data, train_labels):
        output = model(x)
        if label == -1: # noise
            loss = loss_fn(output, torch.zeros_like(output))
        else: # real
            loss = loss_fn(output, torch.ones_like(output))
        #print(f'epoch: {epoch}, x: {x}, label: {label}, output: {output}, loss: {loss}')
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()  # Update model parameters

# -- Evaluation --
correct = 0
total = 0
with torch.no_grad():
    for x, label in zip(test_data, test_labels):
        output = model(x)
        predicted = (output > 0.5).item()  # Predict label based on output score
        #print(f'x: {x}, label: {label}, output: {output}, predicted: {predicted}')
        total += 1
        if label == -1:
          if predicted == False:
            correct += 1
        else:
          if predicted == True:
            correct += 1

print(f"Accuracy: {correct / total * 100:.2f}%")

# -- Visualization --
plt.figure(figsize=(6, 6))
plt.scatter(data[:,0], data[:,1], c=labels)
plt.title("Generated Data with NCE Decision Boundary")

# Approximate decision boundary (could be improved)
x_vals = np.linspace(-2, 4, 100)
y_vals = np.linspace(-2, 4, 100)
xv, yv = np.meshgrid(x_vals, y_vals)
grid_input = torch.tensor(np.stack([xv.ravel(), yv.ravel()], axis=1), dtype=torch.float32)
grid_output = model(grid_input).view(xv.shape).detach().numpy()
plt.contour(xv, yv, grid_output, levels=[0.5], colors='k')

plt.show()

Conclusion

NCE offers an efficient and scalable alternative to the softmax function for large-scale and complex machine learning tasks. By cleverly reframing the problem and reducing computational demands, NCE enables the handling of problems that were previously impractical to solve. As machine learning continues to advance, techniques like NCE are invaluable for pushing the boundaries of what we can achieve with our models.