Building a Deep Learning Model for Voice Classification using PyTorch

3 min readMar 8, 2024

In recent years, the proliferation of deep fake technologies has raised concerns about the authenticity of audio and video content on the internet. Voice classification, the task of distinguishing between real and manipulated voices, plays a crucial role in combating misinformation and ensuring the integrity of multimedia content.

In this article, we’ll explore how to build a deep learning model for voice classification using PyTorch, a popular open-source machine learning framework. We’ll walk through the process step by step, from data collection and preprocessing to model training and evaluation.

Problem Statement

The task of voice classification involves identifying whether an audio recording contains a real or manipulated voice. With the rise of deep fake technologies, detecting manipulated voices has become increasingly challenging. However, by leveraging deep learning techniques, we can develop robust models capable of accurately distinguishing between real and fake voices.

Data Collection and Preprocessing

Before we can train our model, we need a dataset of audio recordings labeled as real or fake. For this demonstration, we’ll use a dataset consisting of audio files from two classes: real and fake.

class AudioDataset(Dataset):
    def __init__(self, file_paths, labels):
        self.file_paths = file_paths
        self.labels = labels

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        audio_path = self.file_paths[idx]
        waveform, sample_rate = torchaudio.load(audio_path)
        # Convert stereo to mono if necessary
        if waveform.size(0) > 1:
            waveform = torch.mean(waveform, dim=0, keepdim=True)
        # Resample if necessary
        if sample_rate != SAMPLE_RATE:
            waveform = torchaudio.transforms.Resample(sample_rate, SAMPLE_RATE)(waveform)
        # Trim or pad audio to fixed duration
        if waveform.size(1) > SAMPLE_RATE * DURATION:
            waveform = waveform[:, :SAMPLE_RATE * DURATION]
        else:
            waveform = torch.nn.functional.pad(waveform, (0, SAMPLE_RATE * DURATION - waveform.size(1)))
        return waveform, self.labels[idx]pythonCopy code

Dataset Splitting

We’ll split our dataset into training and testing sets using the train_test_split function from scikit-learn.

from sklearn.model_selection import train_test_split

train_files, test_files, train_labels, test_labels = train_test_split(file_paths, labels, test_size=0.2, random_state=42)Model Architecture

For our model architecture, we’ll use a simple convolutional neural network (CNN) with two convolutional layers followed by max pooling layers and fully connected layers.

import torch.nn as nn

class AudioClassifier(nn.Module):
    def __init__(self):
        super(AudioClassifier, self).__init__()
        self.conv1 = nn.Conv1d(1, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv1d(32, 64, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool1d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(64 * (SAMPLE_RATE * DURATION // 4), 128)
        self.fc2 = nn.Linear(128, 2)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = torch.flatten(x, 1)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

Training the Model

We’ll train the model using the training data and monitor its performance over multiple epochs.

import torch.optim as optim

# Initialize model, loss function, and optimizer
model = AudioClassifier().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Training loop
for epoch in range(NUM_EPOCHS):
    model.train()
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        if (i + 1) % 10 == 0:
            print(f'Epoch [{epoch + 1}/{NUM_EPOCHS}], Step [{i + 1}/{len(train_loader)}], Loss: {loss.item():.4f}')Evaluation and Testing

After training, we’ll evaluate the model’s performance on the test set to assess its accuracy and effectiveness in classifying real and fake voices.

# Evaluate model on test set
model.eval()
correct, total = 0, 0
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy on test set: {(correct / total) * 100:.2f}%')Our trained model achieves an accuracy of [insert accuracy value here] on the test set, demonstrating its effectiveness in classifying real and fake voices. However, further experimentation and refinement may be necessary to improve its performance in real-world scenarios.

Conclusion

In this article, we’ve demonstrated how to build a deep learning model for voice classification using PyTorch. By leveraging deep learning techniques and suitable datasets, we can develop robust models capable of distinguishing between real and manipulated voices, thereby contributing to the fight against misinformation and ensuring the integrity of multimedia content.

Building a Deep Learning Model for Voice Classification using PyTorch

Written by Niraporn Kov