KAN: Kolmogorov–Arnold Networks

Rabi Kumar Singh
6 min readMay 4, 2024

--

MLP vs KAN

The key points of the article are:

  • KANs vs MLPs: Kolmogorov-Arnold Networks (KANs) are introduced as alternatives to Multi-Layer Perceptron’s (MLPs), featuring learnable activation functions on edges instead of fixed ones on nodes.
  • No Linear Weights: KANs eliminate linear weights, replacing them with univariate functions parameterized as splines, enhancing accuracy and interpretability.
  • Performance: KANs outperform MLPs with smaller sizes in data fitting and PDE solving, and exhibit faster neural scaling laws.
  • Interactivity & Visualization: KANs offer intuitive visualization and user interaction, aiding in the discovery of mathematical and physical laws. They are presented as valuable tools for scientific collaboration.

Introduction

The detailed comparison between Multi-Layer Perceptrons (MLPs) and Kolmogorov-Arnold Networks (KANs), proposing KANs as a promising alternative to MLPs. Here are the key takeaways:

  • MLP Limitations: MLPs are widely used for approximating nonlinear functions but have drawbacks, such as being less interpretable and consuming many parameters.
  • KAN Introduction: KANs are inspired by the Kolmogorov-Arnold representation theorem and feature learnable activation functions on edges, with no linear weight matrices.
  • KAN Advantages: KANs can outperform MLPs in accuracy and parameter efficiency, especially in tasks like PDE solving, due to their unique structure.
  • KAN Structure: Unlike MLPs, KANs replace each weight parameter with a learnable 1D function parametrized as a spline, and their nodes simply sum incoming signals without non-linearities.
  • Generalization of Representation: KANs generalize the original Kolmogorov-Arnold representation to allow for arbitrary widths and depths, enhancing their potential as foundational models for AI and science.
  • Combination of Techniques: They combine splines and MLPs (Multi-Layer Perceptrons), utilizing the strengths of both to improve accuracy and interpretability.
  • Overcoming Dimensionality: While splines are limited by the curse of dimensionality, KANs leverage the feature learning of MLPs to better handle high-dimensional data.
  • Dual Optimization: KANs are designed to optimize both the compositional structure and univariate functions, leading to more accurate learning of complex functions.

Kolmogorov–Arnold Networks (KAN)

Kolmogorov–Arnold Networks (KANs), a new type of neural network inspired by the Kolmogorov-Arnold representation theorem, as an alternative to Multi-Layer perceptron (MLPs). Here are the key takeaways:

  • Inspiration: KANs are inspired by the Kolmogorov-Arnold representation theorem, which suggests a different approach to neural network design compared to MLPs.
  • Expressive Power: The theoretical guarantees for the expressive power of KANs and discusses their neural scaling laws.
  • Grid Extension: A grid extension technique is proposed to enhance the accuracy of KANs, allowing them to become more precise.
  • Interpretability: The authors suggest simplification techniques to make KANs more interpretable, facilitating understanding and interaction with the model.

Kolmogorov-Arnold Representation theorem

  • Theorem Overview: It states that any multivariate continuous function on a bounded domain can be represented as a composition of continuous univariate functions and addition.
  • Machine Learning Impact: Initially deemed impractical for machine learning due to non-smooth and complex 1D functions, the theorem is now seen with optimism for its potential in learning high-dimensional functions.
  • Generalization Approach: The authors propose generalizing the theorem beyond its original two-layer structure to accommodate arbitrary widths and depths, which could facilitate smoother representations.
  • Philosophical Perspective: Emphasizing typical rather than worst-case scenarios, the authors suggest that most scientific and daily functions are smooth and structured, making them suitable for machine learning applications.

KAN architecture

The concept of Kolmogorov-Arnold Networks (KANs), which are designed for a supervised learning task to approximate a function (f) that maps input-output pairs ( {xi, yi} ). Here are the key takeaways:

  • B-Spline Parametrization: KANs use B-spline curves to parametrize univariate functions, allowing for learnable coefficients.
  • Network Structure: They are structured as two-layer neural networks with activation functions on edges, and nodes perform simple summation.
  • Width and Depth: The basic KAN model is known to be too simple, prompting a generalization to wider and deeper networks.
  • Theoretical Foundation: KANs are inspired by the Kolmogorov-Arnold representation theorem, but a generalized version for deeper networks is not yet established.
  • KAN Layer Definition: A KAN layer is defined as a matrix of 1D functions, each with trainable parameters, allowing for the creation of deeper networks by stacking more layers.
  • Notation and Structure: The shape of a KAN is represented by an integer array indicating the number of nodes in each layer. Activation functions connect nodes between layers, and the output of each node is the sum of its incoming activations.
  • Matrix Representation: The activations between layers are represented in matrix form, where each element corresponds to a 1D function connecting nodes across layers.
  • Network Composition: A general KAN network is composed of multiple layers, and the output is obtained by applying the function matrices of all layers to the input vector.
  • KAN vs MLP: KANs integrate linear transformations and nonlinearities into a single function matrix (Φ), unlike MLPs which treat them separately1. KANs are trained using back propagation and can be represented in a more abstract and intuitive way compared to MLPs.
  • Implementation Tricks: To optimize KANs, several techniques are used:
  • Residual Activation Functions: Incorporating a basis function ( b(x) ) alongside a spline function, with the activation function ( \phi(x) ) being their sum.
  • Initialization Scales: Activation functions are initialized with the spline part close to zero and weights initialized using Xavier initialization.
  • Spline Grid Updates: Spline grids are updated dynamically during training to accommodate activation values outside the predefined region.
  • Parameter Efficiency: Although KANs appear to require more parameters than MLPs, they often need a smaller network size, leading to better generalization and interpretability.
  • Scaling Laws: The document compares different scaling exponents from various theories, highlighting that KANs have a scaling exponent of ( k + 1 ), which is favorable compared to other models.
  • Learnable Activation Functions: Unlike MLPs, which have fixed activation functions on nodes, KANs feature learnable activation functions on the edges. This means that every weight parameter in a KAN is replaced by a univariate function, typically parametrized as a spline.
  • No Linear Weights: KANs do not use linear weight matrices. Instead, each connection’s weight is a learnable 1D function, allowing for a more flexible and potentially more powerful model.
  • Summation Nodes: The nodes in KANs simply sum the incoming signals without applying any non-linearities, which is a departure from the typical neuron model in MLPs.
  • Interpretability and Visualization: Due to their structure, KANs can be more interpretable than MLPs. They allow for intuitive visualization and easier interaction with human users.
  • Applications: KANs have shown to outperform MLPs in terms of accuracy and interpretability, particularly in tasks like data fitting and solving partial differential equations.

KAN vs MLP

The architecture of Kolmogorov–Arnold Networks (KANs) and Multi-Layer Perceptrons (MLPs) differ significantly in several key aspects:

  1. Activation Functions:
  • MLPs: They have fixed activation functions on nodes (neurons), such as ReLU or sigmoid functions.
  • KANs: Instead of fixed activation functions on nodes, KANs have learnable activation functions on edges (weights), typically parametrized as splines.

2. Weight Parameters:

  • MLPs: Use linear weight matrices where inputs are multiplied by weights.
  • KANs: Have no linear weights at all. Every weight parameter in a KAN is replaced by a univariate function.

3. Node Operations:

  • MLPs: Neurons sum the weighted inputs and then apply a non-linear activation function.
  • KANs: Nodes simply sum incoming signals without applying any non-linearities.

4. Interpretability and Visualization:

  • MLPs: Typically, less interpretable without post-analysis tools.
  • KANs: Offer intuitive visualization and easier interaction with human users, making them more interpretable1.

5. Performance:

  • MLPs: While they are powerful due to the universal approximation theorem, they can be less efficient in certain tasks.
  • KANs: Smaller KANs can achieve comparable or better accuracy than much larger MLPs in tasks like data fitting and solving partial differential equations. KANs also possess faster neural scaling laws than MLPs.

6. Applications:

  • MLPs: Are foundational building blocks of today’s deep learning models, used for approximating nonlinear functions.
  • KANs: Can be useful collaborators helping scientists (re)discover mathematical and physical laws due to their structure and interpretability.

Connect me here

LinkedIn , Kaggle, Github, HuggingFace

--

--