Can KAN(Kolmogorov-Arnold Networks) replace MLPs?

Published in

ANT BRaiN

4 min readMay 18, 2024

Every so often, a machine learning paper emerges that significantly influences the direction of research, such as the Transformers paper and the ChatGPT paper. However, Kolmogorov-Arnold Networks (KAN) paper may not fall into this category.

Introduction

Choosing the right neural network architecture is crucial in machine learning. While MLPs (Multi-Layer Perceptrons) have been the backbone of many models due to their simplicity and adaptability, KAN (Kolmogorov-Arnold Networks) introduce intriguing new features. This article explores whether KAN can truly replace MLPs or if they simply add another layer of complexity to the existing frameworks.

MLPs are celebrated for their simplicity and flexibility. They function like a chameleon, capable of adapting to various structures and sizes. By increasing the size of the linear layer and removing expands and shifts, we can enhance the modeling power, albeit at a higher parameter cost. This flexibility is a significant advantage of MLPs over KAN.

KANs bring a new dimension with their edge activation functions, which introduce complexity akin to adding icing on a cake. These networks allow for updating intermediate activation layers, promising a nuanced approach to modeling. However, the fundamental structure and effectiveness of MLPs remain unparalleled.

One of the intriguing aspects of KAN is the potential for updating intermediate activation layers. However, we can rewrite a KAN network into an ordinary MLP with the same number of parameters, albeit with a slightly atypical structure.

Technical Breakdown

KAN employs an activation function on its edges, often using B-splines. For simplicity, consider a piece-wise linear function:

def f(x):
    return -2*x if x < 0 else (-0.5*x if x < 1 else 2*x - 2.5)

We can rewrite the above function using multiple ReLUs and linear functions:

def g(x):
    return 1.5*torch.relu(x) + 2.5*torch.relu(x-1) + -2*X

Output: Graph for both the functions are same.

Graphs of 2 different functions without using and using ReLU.

Consider a scenario where you have n input neurons, m output neurons, and a piecewise linear function with k segments. This setup requires 𝑛∗𝑚∗𝑘 parameters (with k parameters for each edge, and a total of 𝑛∗𝑚 edges).

Now consider one KAN edge. For that we need to replicate input 𝑘 times, shift each copy by a constant and run through ReLU and then linear layer (except the first one). Graphically it looks like this (C are constants, W are weights):

We can now apply this process to each edge. However, it’s important to note that if the grid for the piecewise linear function remains consistent throughout, we can share the intermediate ReLU outputs and simply adjust the weights on top of them, like this:

# Define constants
grid_size = 3
input_dim = 3
output_dim = 5
batch_dim = 20

# Generate random input
input_tensor = torch.randn(batch_dim, input_dim)

# Define linear layer
linear_layer = nn.Linear(input_dim * grid_size, output_dim)

# Repeat input tensor
repeated_tensor = input_tensor.unsqueeze(1).repeat(1, grid_size, 1)

# Define shifts
shift_values = torch.linspace(-1, 1, grid_size).reshape(1, grid_size, 1)

# Apply shifts
shifted_tensor = repeated_tensor + shift_values

# Apply ReLU activation and concatenate
intermediate_tensor = torch.cat([shifted_tensor[:, :1, :], torch.relu(shifted_tensor[:, 1:, :])], dim=1).flatten(1)

# Apply linear layer
output_tensor = linear_layer(intermediate_tensor)

Output:

Now our layer looks like this:
Expand + shift + ReLU
Linear

Consider three layers after each other:
Expand + shift + ReLU (Layer 1 starts here)
Linear
Expand + shift + ReLU (Layer 2 starts here)
Linear
Expand + shift + ReLU (Layer 3 starts here)
Linear

Ignoring input expansion we can rearange:
Linear (Layer 1 starts here)
Expand + shift + ReLU
Linear (Layer 2 starts here)
Expand + shift + ReLU

And layer like:
Linear (Layer 2 starts here)
Expand + shift + ReLU

The magic trick up KAN’s sleeve is the rearrangement of layers. By cleverly rearranging the layers, we can transform a KAN layer into a typical MLP layer. It’s like solving a Rubik’s cube, where the right moves can lead to the desired outcome.

While KAN brings some interesting concepts to the table, it doesn’t necessarily replace the good old MLP. It’s like adding a new ingredient to a classic recipe. The new flavor might be exciting, but it doesn’t necessarily make the dish better. The MLP, with its simplicity, flexibility, and proven effectiveness, will continue to hold its ground in the ever-evolving landscape of machine learning.

Can KAN(Kolmogorov-Arnold Networks) replace MLPs?

Introduction

Technical Breakdown

Output: Graph for both the functions are same.

Output:

Written by Aman Rangapur