Neural networks are well known to be over-parameterized and can often easily fit data with near-zero training loss with decent generalization performance on test dataset. Although all these parameters are initialized at random, the optimization process can consistently lead to similarly good outcomes. And this is true even when the number of model parameters exceeds the number of training data points.

Neural tangent kernel (NTK) (Jacot et al. 2018) is a kernel to explain the evolution of neural networks during training via gradient descent. It leads to great insights into why neural networks with enough width can consistently converge to a global minimum when trained to minimize an empirical loss. In the post, we will do a deep dive into the motivation and definition of NTK, as well as the proof of a deterministic convergence at different initializations of neural networks with infinite width by characterizing NTK in such a setting.

🤓 Different from my previous posts, this one mainly focuses on a small number of core papers, less on the breadth of the literature review in the field. There are many interesting works after NTK, with modification or expansion of the theory for understanding the learning dynamics of NNs, but they won’t be covered here. The goal is to show all the math behind NTK in a clear and easy-to-follow format, so the post is quite math-intensive. If you notice any mistakes, please let me know and I will be happy to correct them quickly. Thanks in advance!

Basics

This section contains reviews of several very basic concepts which are core to understanding of neural tangent kernel. Feel free to skip.

Vector-to-vector Derivative

Given an input vector $x \in R^{n}$ (as a column vector) and a function $f : R^{n} \to R^{m}$ , the derivative of $f$ with respective to $x$ is a $m \times n$ matrix, also known as Jacobian matrix:

J = \frac{\partial f}{\partial x} = [\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} \\ ⋮ \\ \frac{\partial f_{m}}{\partial x_{1}} & \dots & \frac{\partial f_{m}}{\partial x_{n}} \end{matrix}] \in R^{m \times n}

Throughout the post, I use integer subscript(s) to refer to a single entry out of a vector or matrix value; i.e. $x_{i}$ indicates the $i$ -th value in the vector $x$ and $f_{i} (.)$ is the $i$ -th entry in the output of the function.

The gradient of a vector with respect to a vector is defined as $\nabla_{x} f = J^{⊤} \in R^{n \times m}$ and this formation is also valid when $m = 1$ (i.e., scalar output).

Differential Equations

Differential equations describe the relationship between one or multiple functions and their derivatives. There are two main types of differential equations.

(1) ODE (Ordinary differential equation) contains only an unknown function of one random variable. ODEs are the main form of differential equations used in this post. A general form of ODE looks like $(x, y, \frac{d y}{d x}, \dots, \frac{d^{n} y}{d x^{n}}) = 0$ .
(2) PDE (Partial differential equation) contains unknown multivariable functions and their partial derivatives.

Let’s review the simplest case of differential equations and its solution. Separation of variables (Fourier method) can be used when all the terms containing one variable can be moved to one side, while the other terms are all moved to the other side. For example,

\begin{aligned} Given a is a constant scalar: \frac{d y}{d x} & = a y \\ Move same variables to the same side: \frac{d y}{y} & = a d x \\ Put integral on both sides: \int \frac{d y}{y} & = \int a d x \\ \ln (y) & = a x + C^{'} \\ Finally y & = e^{a x + C^{'}} = C e^{a x} \end{aligned}

Central Limit Theorem

Given a collection of i.i.d. random variables, $x_{1}, \dots, x_{N}$ with mean $μ$ and variance $σ^{2}$ , the Central Limit Theorem (CTL) states that the expectation would be Gaussian distributed when $N$ becomes really large.

\bar{x} = \frac{1}{N} \sum_{i = 1}^{N} x_{i} \sim N (μ, \frac{σ^{2}}{n}) when N \to \infty

CTL can also apply to multidimensional vectors, and then instead of a single scale $σ^{2}$ we need to compute the covariance matrix of random variable $Σ$ .

Taylor Expansion

The Taylor expansion is to express a function as an infinite sum of components, each represented in terms of this function’s derivatives. The Tayler expansion of a function $f (x)$ at $x = a$ can be written as: $f (x) = f (a) + \sum_{k = 1}^{\infty} \frac{1}{k!} (x - a)^{k} \nabla_{x}^{k} f (x) |_{x = a}$ where $\nabla^{k}$ denotes the $k$ -th derivative.

The first-order Taylor expansion is often used as a linear approximation of the function value:

f (x) \approx f (a) + (x - a) \nabla_{x} f (x) |_{x = a}

Kernel & Kernel Methods

A kernel is essentially a similarity function between two data points, $K : X \times X \to R$ . It describes how sensitive the prediction for one data sample is to the prediction for the other; or in other words, how similar two data points are. The kernel should be symmetric, $K (x, x^{'}) = K (x^{'}, x)$ .

Depending on the problem structure, some kernels can be decomposed into two feature maps, one corresponding to one data point, and the kernel value is an inner product of these two features: $K (x, x^{'}) = ⟨ φ (x), φ (x^{'}) ⟩$ .

Kernel methods are a type of non-parametric, instance-based machine learning algorithms. Assuming we have known all the labels of training samples ${x^{(i)}, y^{(i)}}$ , the label for a new input $x$ is predicted by a weighted sum $\sum_{i} K (x^{(i)}, x) y^{(i)}$ .

Gaussian Processes

Gaussian process (GP) is a non-parametric method by modeling a multivariate Gaussian probability distribution over a collection of random variables. GP assumes a prior over functions and then updates the posterior over functions based on what data points are observed.

Given a collection of data points ${x^{(1)}, \dots, x^{(N)}}$ , GP assumes that they follow a jointly multivariate Gaussian distribution, defined by a mean $μ (x)$ and a covariance matrix $Σ (x)$ . Each entry at location $(i, j)$ in the covariance matrix $Σ (x)$ is defined by a kernel $Σ_{i, j} = K (x^{(i)}, x^{(j)})$ , also known as a covariance function. The core idea is – if two data points are deemed similar by the kernel, the function outputs should be close, too. Making predictions with GP for unknown data points is equivalent to drawing samples from this distribution, via a conditional distribution of unknown data points given observed ones.

Check this post for a high-quality and highly visualization tutorial on what Gaussian Processes are.

Notation

Let us consider a fully-connected neural networks with parameter $θ$ , $f (.; θ) : R^{n_{0}} \to R^{n_{L}}$ . Layers are indexed from 0 (input) to $L$ (output), each containing $n_{0}, \dots, n_{L}$ neurons, including the input of size $n_{0}$ and the output of size $n_{L}$ . There are $P = \sum_{l = 0}^{L - 1} (n_{l} + 1) n_{l + 1}$ parameters in total and thus we have $θ \in R^{P}$ .

The training dataset contains $N$ data points, $D = {x^{(i)}, y^{(i)}}_{i = 1}^{N}$ . All the inputs are denoted as $X = {x^{(i)}}_{i = 1}^{N}$ and all the labels are denoted as $Y = {y^{(i)}}_{i = 1}^{N}$ .

Now let’s look into the forward pass computation in every layer in detail. For $l = 0, \dots, L - 1$ , each layer $l$ defines an affine transformation $A^{(l)}$ with a weight matrix $w^{(l)} \in R^{n_{l} \times n_{l + 1}}$ and a bias term $b^{(l)} \in R^{n_{l + 1}}$ , as well as a pointwise nonlinearity function $σ (.)$ which is Lipschitz continuous.

\begin{aligned} A^{(0)} & = x \\ {\tilde{A}}^{(l + 1)} (x) & = \frac{1}{\sqrt{n_{l}}} {w^{(l)}}^{⊤} A^{(l)} + β b^{(l)} \in R^{n_{l + 1}} & ; pre-activations \\ A^{(l + 1)} (x) & = σ ({\tilde{A}}^{(l + 1)} (x)) \in R^{n_{l + 1}} & ; post-activations \end{aligned}

Note that the NTK parameterization applies a rescale weight $1 / \sqrt{n_{l}}$ on the transformation to avoid divergence with infinite-width networks. The constant scalar $β \geq 0$ controls how much effort the bias terms have.

All the network parameters are initialized as an i.i.d Gaussian $N (0, 1)$ in the following analysis.

Neural Tangent Kernel

Neural tangent kernel (NTK) (Jacot et al. 2018) is an important concept for understanding neural network training via gradient descent. At its core, it explains how updating the model parameters on one data sample affects the predictions for other samples.

Let’s start with the intuition behind NTK, step by step.

The empirical loss function $L : R^{P} \to R_{+}$ to minimize during training is defined as follows, using a per-sample cost function $ℓ : R^{n_{0}} \times R^{n_{L}} \to R_{+}$ :

L (θ) = \frac{1}{N} \sum_{i = 1}^{N} ℓ (f (x^{(i)}; θ), y^{(i)})

and according to the chain rule. the gradient of the loss is:

\nabla_{θ} L (θ) = \frac{1}{N} \sum_{i = 1}^{N} \underset{size P \times n_{L}}{\underset{⏟}{\nabla_{θ} f (x^{(i)}; θ)}} \underset{size n_{L} \times 1}{\underset{⏟}{\nabla_{f} ℓ (f, y^{(i)})}}

When tracking how the network parameter $θ$ evolves in time, each gradient descent update introduces a small incremental change of an infinitesimal step size. Because of the update step is small enough, it can be approximately viewed as a derivative on the time dimension:

\frac{d θ}{d t} = - \nabla_{θ} L (θ) = - \frac{1}{N} \sum_{i = 1}^{N} \nabla_{θ} f (x^{(i)}; θ) \nabla_{f} ℓ (f, y^{(i)})

Again, by the chain rule, the network output evolves according to the derivative:

\frac{d f (x; θ)}{d t} = \frac{d f (x; θ)}{d θ} \frac{d θ}{d t} = - \frac{1}{N} \sum_{i = 1}^{N} \underset{Neural tangent kernel}{\underset{⏟}{\nabla_{θ} f (x; θ)^{⊤} \nabla_{θ} f (x^{(i)}; θ)}} \nabla_{f} ℓ (f, y^{(i)})

Here we find the Neural Tangent Kernel (NTK), as defined in the blue part in the above formula, $K : R^{n_{0}} \times R^{n_{0}} \to R^{n_{L} \times n_{L}}$ :

K (x, x^{'}; θ) = \nabla_{θ} f (x; θ)^{⊤} \nabla_{θ} f (x^{'}; θ)

where each entry in the output matrix at location $(m, n), 1 \leq m, n \leq n_{L}$ is:

K_{m, n} (x, x^{'}; θ) = \sum_{p = 1}^{P} \frac{\partial f_{m} (x; θ)}{\partial θ_{p}} \frac{\partial f_{n} (x^{'}; θ)}{\partial θ_{p}}

The “feature map” form of one input $x$ is $φ (x) = \nabla_{θ} f (x; θ)$ .

Infinite Width Networks

To understand why the effect of one gradient descent is so similar for different initializations of network parameters, several pioneering theoretical work starts with infinite width networks. We will look into detailed proof using NTK of how it guarantees that infinite width networks can converge to a global minimum when trained to minimize an empirical loss.

Connection with Gaussian Processes

Deep neural networks have deep connection with gaussian processes (Neal 1994). The output functions of a $L$ -layer network, $f_{i} (x; θ)$ for $i = 1, \dots, n_{L}$ , are i.i.d. centered Gaussian process of covariance $Σ^{(L)}$ , defined recursively as:

\begin{aligned} Σ^{(1)} (x, x^{'}) & = \frac{1}{n_{0}} x^{⊤} x^{'} + β^{2} \\ λ^{(l + 1)} (x, x^{'}) & = [\begin{array}{c} Σ^{(l)} (x, x) & Σ^{(l)} (x, x^{'}) \\ Σ^{(l)} (x^{'}, x) & Σ^{(l)} (x^{'}, x^{'}) \end{array}] \\ Σ^{(l + 1)} (x, x^{'}) & = E_{f \sim N (0, λ^{(l)})} [σ (f (x)) σ (f (x^{'}))] + β^{2} \end{aligned}

Lee & Bahri et al. (2018) showed a proof by mathematical induction:

(1) Let’s start with $L = 1$ , when there is no nonlinearity function and the input is only processed by a simple affine transformation:

\begin{aligned} f (x; θ) = {\tilde{A}}^{(1)} (x) & = \frac{1}{\sqrt{n_{0}}} {w^{(0)}}^{⊤} x + β b^{(0)} \\ where {\tilde{A}}_{m}^{(1)} (x) & = \frac{1}{\sqrt{n_{0}}} \sum_{i = 1}^{n_{0}} w_{i m}^{(0)} x_{i} + β b_{m}^{(0)} for 1 \leq m \leq n_{1} \end{aligned}

Since the weights and biases are initialized i.i.d., all the output dimensions of this network ${\tilde{A}}_{1}^{(1)} (x), \dots, {\tilde{A}}_{n_{1}}^{(1)} (x)$ are also i.i.d. Given different inputs, the $m$ -th network outputs ${\tilde{A}}_{m}^{(1)} (.)$ have a joint multivariate Gaussian distribution, equivalent to a Gaussian process with covariance function (We know that mean $μ_{w} = μ_{b} = 0$ and variance $σ_{w}^{2} = σ_{b}^{2} = 1$ )

\begin{aligned} Σ^{(1)} (x, x^{'}) & = E [{\tilde{A}}_{m}^{(1)} (x) {\tilde{A}}_{m}^{(1)} (x^{'})] \\ = E [(\frac{1}{\sqrt{n_{0}}} \sum_{i = 1}^{n_{0}} w_{i, m}^{(0)} x_{i} + β b_{m}^{(0)}) (\frac{1}{\sqrt{n_{0}}} \sum_{i = 1}^{n_{0}} w_{i, m}^{(0)} x_{i}^{'} + β b_{m}^{(0)})] \\ = \frac{1}{n_{0}} σ_{w}^{2} \sum_{i = 1}^{n_{0}} \sum_{j = 1}^{n_{0}} x_{i} {x^{'}}_{j} + \frac{β μ_{b}}{\sqrt{n_{0}}} \sum_{i = 1}^{n_{0}} w_{i m} (x_{i} + x_{i}^{'}) + σ_{b}^{2} β^{2} \\ = \frac{1}{n_{0}} x^{⊤} x^{'} + β^{2} \end{aligned}

(2) Using induction, we first assume the proposition is true for $L = l$ , a $l$ -layer network, and thus ${\tilde{A}}_{m}^{(l)} (.)$ is a Gaussian process with covariance $Σ^{(l)}$ and ${{\tilde{A}}_{i}^{(l)}}_{i = 1}^{n_{l}}$ are i.i.d.

Then we need to prove the proposition is also true for $L = l + 1$ . We compute the outputs by:

\begin{aligned} f (x; θ) = {\tilde{A}}^{(l + 1)} (x) & = \frac{1}{\sqrt{n_{l}}} {w^{(l)}}^{⊤} σ ({\tilde{A}}^{(l)} (x)) + β b^{(l)} \\ where {\tilde{A}}_{m}^{(l + 1)} (x) & = \frac{1}{\sqrt{n_{l}}} \sum_{i = 1}^{n_{l}} w_{i m}^{(l)} σ ({\tilde{A}}_{i}^{(l)} (x)) + β b_{m}^{(l)} for 1 \leq m \leq n_{l + 1} \end{aligned}

We can infer that the expectation of the sum of contributions of the previous hidden layers is zero:

\begin{aligned} E [w_{i m}^{(l)} σ ({\tilde{A}}_{i}^{(l)} (x))] & = E [w_{i m}^{(l)}] E [σ ({\tilde{A}}_{i}^{(l)} (x))] = μ_{w} E [σ ({\tilde{A}}_{i}^{(l)} (x))] = 0 \\ E [(w_{i m}^{(l)} σ ({\tilde{A}}_{i}^{(l)} (x)))^{2}] & = E [{w_{i m}^{(l)}}^{2}] E [σ ({\tilde{A}}_{i}^{(l)} (x))^{2}] = σ_{w}^{2} Σ^{(l)} (x, x) = Σ^{(l)} (x, x) \end{aligned}

Since ${{\tilde{A}}_{i}^{(l)} (x)}_{i = 1}^{n_{l}}$ are i.i.d., according to central limit theorem, when the hidden layer gets infinitely wide $n_{l} \to \infty$ , ${\tilde{A}}_{m}^{(l + 1)} (x)$ is Gaussian distributed with variance $β^{2} + Var ({\tilde{A}}_{i}^{(l)} (x))$ . Note that ${\tilde{A}}_{1}^{(l + 1)} (x), \dots, {\tilde{A}}_{n_{l + 1}}^{(l + 1)} (x)$ are still i.i.d.

${\tilde{A}}_{m}^{(l + 1)} (.)$ is equivalent to a Gaussian process with covariance function:

\begin{aligned} Σ^{(l + 1)} (x, x^{'}) & = E [{\tilde{A}}_{m}^{(l + 1)} (x) {\tilde{A}}_{m}^{(l + 1)} (x^{'})] \\ = \frac{1}{n_{l}} σ ({\tilde{A}}_{i}^{(l)} (x))^{⊤} σ ({\tilde{A}}_{i}^{(l)} (x^{'})) + β^{2};similar to how we get Σ^{(1)} \end{aligned}

When $n_{l} \to \infty$ , according to central limit theorem,

Σ^{(l + 1)} (x, x^{'}) \to E_{f \sim N (0, Λ^{(l)})} [σ (f (x))^{⊤} σ (f (x^{'}))] + β^{2}

The form of Gaussian processes in the above process is referred to as the Neural Network Gaussian Process (NNGP) (Lee & Bahri et al. (2018)).

Deterministic Neural Tangent Kernel

Finally we are now prepared enough to look into the most critical proposition from the NTK paper:

When $n_{1}, \dots, n_{L} \to \infty$ (network with infinite width), the NTK converges to be:

(1) deterministic at initialization, meaning that the kernel is irrelevant to the initialization values and only determined by the model architecture; and
(2) stays constant during training.

The proof depends on mathematical induction as well:

(1) First of all, we always have $K^{(0)} = 0$ . When $L = 1$ , we can get the representation of NTK directly. It is deterministic and does not depend on the network initialization. There is no hidden layer, so there is nothing to take on infinite width.

\begin{aligned} f (x; θ) & = {\tilde{A}}^{(1)} (x) = \frac{1}{\sqrt{n_{0}}} {w^{(0)}}^{⊤} x + β b^{(0)} \\ K^{(1)} (x, x^{'}; θ) & = (\frac{\partial f (x^{'}; θ)}{\partial w^{(0)}})^{⊤} \frac{\partial f (x; θ)}{\partial w^{(0)}} + (\frac{\partial f (x^{'}; θ)}{\partial b^{(0)}})^{⊤} \frac{\partial f (x; θ)}{\partial b^{(0)}} \\ = \frac{1}{n_{0}} x^{⊤} x^{'} + β^{2} = Σ^{(1)} (x, x^{'}) \end{aligned}

(2) Now when $L = l$ , we assume that a $l$ -layer network with $\tilde{P}$ parameters in total, $\tilde{θ} = (w^{(0)}, \dots, w^{(l - 1)}, b^{(0)}, \dots, b^{(l - 1)}) \in R^{\tilde{P}}$ , has a NTK converging to a deterministic limit when $n_{1}, \dots, n_{l - 1} \to \infty$ .

K^{(l)} (x, x^{'}; \tilde{θ}) = \nabla_{\tilde{θ}} {\tilde{A}}^{(l)} (x)^{⊤} \nabla_{\tilde{θ}} {\tilde{A}}^{(l)} (x^{'}) \to K_{\infty}^{(l)} (x, x^{'})

Note that $K_{\infty}^{(l)}$ has no dependency on $θ$ .

Next let’s check the case $L = l + 1$ . Compared to a $l$ -layer network, a $(l + 1)$ -layer network has additional weight matrix $w^{(l)}$ and bias $b^{(l)}$ and thus the total parameters contain $θ = (\tilde{θ}, w^{(l)}, b^{(l)})$ .

The output function of this $(l + 1)$ -layer network is:

f (x; θ) = {\tilde{A}}^{(l + 1)} (x; θ) = \frac{1}{\sqrt{n_{l}}} {w^{(l)}}^{⊤} σ ({\tilde{A}}^{(l)} (x)) + β b^{(l)}

And we know its derivative with respect to different sets of parameters; let denote ${\tilde{A}}^{(l)} = {\tilde{A}}^{(l)} (x)$ for brevity in the following equation:

\begin{aligned} \nabla_{w^{(l)}} f (x; θ) & = \frac{1}{\sqrt{n_{l}}} σ ({\tilde{A}}^{(l)})^{⊤} \in R^{1 \times n_{l}} \\ \nabla_{b^{(l)}} f (x; θ) & = β \\ \nabla_{\tilde{θ}} f (x; θ) & = \frac{1}{\sqrt{n_{l}}} \nabla_{\tilde{θ}} σ ({\tilde{A}}^{(l)}) w^{(l)} \\ = \frac{1}{\sqrt{n_{l}}} [\begin{array}{c} \dot{σ} ({\tilde{A}}_{1}^{(l)}) \frac{\partial {\tilde{A}}_{1}^{(l)}}{\partial {\tilde{θ}}_{1}} & \dots & \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) \frac{\partial {\tilde{A}}_{n_{l}}^{(l)}}{\partial {\tilde{θ}}_{1}} \\ ⋮ \\ \dot{σ} ({\tilde{A}}_{1}^{(l)}) \frac{\partial {\tilde{A}}_{1}^{(l)}}{\partial {\tilde{θ}}_{\tilde{P}}} & \dots & \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) \frac{\partial {\tilde{A}}_{n_{l}}^{(l)}}{\partial {\tilde{θ}}_{\tilde{P}}} \end{array}] w^{(l)} \in R^{\tilde{P} \times n_{l + 1}} \end{aligned}

where $\dot{σ}$ is the derivative of $σ$ and each entry at location $(p, m), 1 \leq p \leq \tilde{P}, 1 \leq m \leq n_{l + 1}$ in the matrix $\nabla_{\tilde{θ}} f (x; θ)$ can be written as

\frac{\partial f_{m} (x; θ)}{\partial {\tilde{θ}}_{p}} = \sum_{i = 1}^{n_{l}} w_{i m}^{(l)} \dot{σ} ({\tilde{A}}_{i}^{(l)}) \nabla_{{\tilde{θ}}_{p}} {\tilde{A}}_{i}^{(l)}

The NTK for this $(l + 1)$ -layer network can be defined accordingly:

\begin{aligned} K^{(l + 1)} (x, x^{'}; θ) \\ = & \nabla_{θ} f (x; θ)^{⊤} \nabla_{θ} f (x; θ) \\ = & \nabla_{w^{(l)}} f (x; θ)^{⊤} \nabla_{w^{(l)}} f (x; θ) + \nabla_{b^{(l)}} f (x; θ)^{⊤} \nabla_{b^{(l)}} f (x; θ) + \nabla_{\tilde{θ}} f (x; θ)^{⊤} \nabla_{\tilde{θ}} f (x; θ) \\ = & \frac{1}{n_{l}} [σ ({\tilde{A}}^{(l)}) σ ({\tilde{A}}^{(l)})^{⊤} + β^{2} \\ + {w^{(l)}}^{⊤} [\begin{array}{c} \dot{σ} ({\tilde{A}}_{1}^{(l)}) \dot{σ} ({\tilde{A}}_{1}^{(l)}) \sum_{p = 1}^{\tilde{P}} \frac{\partial {\tilde{A}}_{1}^{(l)}}{\partial {\tilde{θ}}_{p}} \frac{\partial {\tilde{A}}_{1}^{(l)}}{\partial {\tilde{θ}}_{p}} & \dots & \dot{σ} ({\tilde{A}}_{1}^{(l)}) \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) \sum_{p = 1}^{\tilde{P}} \frac{\partial {\tilde{A}}_{1}^{(l)}}{\partial {\tilde{θ}}_{p}} \frac{\partial {\tilde{A}}_{n_{l}}^{(l)}}{\partial {\tilde{θ}}_{p}} \\ ⋮ \\ \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) \dot{σ} ({\tilde{A}}_{1}^{(l)}) \sum_{p = 1}^{\tilde{P}} \frac{\partial {\tilde{A}}_{n_{l}}^{(l)}}{\partial {\tilde{θ}}_{p}} \frac{\partial {\tilde{A}}_{1}^{(l)}}{\partial {\tilde{θ}}_{p}} & \dots & \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) \sum_{p = 1}^{\tilde{P}} \frac{\partial {\tilde{A}}_{n_{l}}^{(l)}}{\partial {\tilde{θ}}_{p}} \frac{\partial {\tilde{A}}_{n_{l}}^{(l)}}{\partial {\tilde{θ}}_{p}} \end{array}] w^{(l)}] \\ = & \frac{1}{n_{l}} [σ ({\tilde{A}}^{(l)}) σ ({\tilde{A}}^{(l)})^{⊤} + β^{2} \\ + {w^{(l)}}^{⊤} [\begin{array}{c} \dot{σ} ({\tilde{A}}_{1}^{(l)}) \dot{σ} ({\tilde{A}}_{1}^{(l)}) K_{11}^{(l)} & \dots & \dot{σ} ({\tilde{A}}_{1}^{(l)}) \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) K_{1 n_{l}}^{(l)} \\ ⋮ \\ \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) \dot{σ} ({\tilde{A}}_{1}^{(l)}) K_{n_{l} 1}^{(l)} & \dots & \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) \dot{σ} ({\tilde{A}}_{n_{l}}^{(l)}) K_{n_{l} n_{l}}^{(l)} \end{array}] w^{(l)}] \end{aligned}

where each individual entry at location $(m, n), 1 \leq m, n \leq n_{l + 1}$ of the matrix $K^{(l + 1)}$ can be written as:

\begin{aligned} K_{m n}^{(l + 1)} = & \frac{1}{n_{l}} [σ ({\tilde{A}}_{m}^{(l)}) σ ({\tilde{A}}_{n}^{(l)}) + β^{2} + \sum_{i = 1}^{n_{l}} \sum_{j = 1}^{n_{l}} w_{i m}^{(l)} w_{i n}^{(l)} \dot{σ} ({\tilde{A}}_{i}^{(l)}) \dot{σ} ({\tilde{A}}_{j}^{(l)}) K_{i j}^{(l)}] \end{aligned}

When $n_{l} \to \infty$ , the section in blue and green has the limit (See the proof in the previous section):

\frac{1}{n_{l}} σ ({\tilde{A}}^{(l)}) σ ({\tilde{A}}^{(l)}) + β^{2} \to Σ^{(l + 1)}

and the red section has the limit:

\sum_{i = 1}^{n_{l}} \sum_{j = 1}^{n_{l}} w_{i m}^{(l)} w_{i n}^{(l)} \dot{σ} ({\tilde{A}}_{i}^{(l)}) \dot{σ} ({\tilde{A}}_{j}^{(l)}) K_{i j}^{(l)} \to \sum_{i = 1}^{n_{l}} \sum_{j = 1}^{n_{l}} w_{i m}^{(l)} w_{i n}^{(l)} \dot{σ} ({\tilde{A}}_{i}^{(l)}) \dot{σ} ({\tilde{A}}_{j}^{(l)}) K_{\infty, i j}^{(l)}

Later, Arora et al. (2019) provided a proof with a weaker limit, that does not require all the hidden layers to be infinitely wide, but only requires the minimum width to be sufficiently large.

Linearized Models

From the previous section, according to the derivative chain rule, we have known that the gradient update on the output of an infinite width network is as follows; For brevity, we omit the inputs in the following analysis:

\begin{aligned} \frac{d f (θ)}{d t} & = - η \nabla_{θ} f (θ)^{⊤} \nabla_{θ} f (θ) \nabla_{f} L \\ = - η \nabla_{θ} f (θ)^{⊤} \nabla_{θ} f (θ) \nabla_{f} L \\ = - η K (θ) \nabla_{f} L \\ = - η K_{\infty} \nabla_{f} L & ; for infinite width network \end{aligned}

To track the evolution of $θ$ in time, let’s consider it as a function of time step $t$ . With Taylor expansion, the network learning dynamics can be simplified as:

f (θ (t)) \approx f^{lin} (θ (t)) = f (θ (0)) + \underset{formally \nabla_{θ} f (x; θ) |_{θ = θ (0)}}{\underset{⏟}{\nabla_{θ} f (θ (0))}} (θ (t) - θ (0))

Such formation is commonly referred to as the linearized model, given $θ (0)$ , $f (θ (0))$ , and $\nabla_{θ} f (θ (0))$ are all constants. Assuming that the incremental time step $t$ is extremely small and the parameter is updated by gradient descent:

\begin{aligned} θ (t) - θ (0) & = - η \nabla_{θ} L (θ) = - η \nabla_{θ} f (θ)^{⊤} \nabla_{f} L \\ f^{lin} (θ (t)) - f (θ (0)) & = - η \nabla_{θ} f (θ (0))^{⊤} \nabla_{θ} f (X; θ (0)) \nabla_{f} L \\ \frac{d f (θ (t))}{d t} & = - η K (θ (0)) \nabla_{f} L \\ \frac{d f (θ (t))}{d t} & = - η K_{\infty} \nabla_{f} L & ; for infinite width network \end{aligned}

Eventually we get the same learning dynamics, which implies that a neural network with infinite width can be considerably simplified as governed by the above linearized model (Lee & Xiao, et al. 2019).

In a simple case when the empirical loss is an MSE loss, $\nabla_{θ} L (θ) = f (X; θ) - Y$ , the dynamics of the network becomes a simple linear ODE and it can be solved in a closed form:

\begin{aligned} \frac{d f (θ)}{d t} = & - η K_{\infty} (f (θ) - Y) \\ \frac{d g (θ)}{d t} = & - η K_{\infty} g (θ) & ; let g (θ) = f (θ) - Y \\ \int \frac{d g (θ)}{g (θ)} = & - η \int K_{\infty} d t \\ g (θ) & = C e^{- η K_{\infty} t} \end{aligned}

When $t = 0$ , we have $C = f (θ (0)) - Y$ and therefore,

f (θ) = (f (θ (0)) - Y) e^{- η K_{\infty} t} + Y = f (θ (0)) e^{- K_{\infty} t} + (I - e^{- η K_{\infty} t}) Y

Lazy Training

People observe that when a neural network is heavily over-parameterized, the model is able to learn with the training loss quickly converging to zero but the network parameters hardly change. Lazy training refers to the phenomenon. In other words, when the loss $L$ has a decent amount of reduction, the change in the differential of the network $f$ (aka the Jacobian matrix) is still very small.

Let $θ (0)$ be the initial network parameters and $θ (T)$ be the final network parameters when the loss has been minimized to zero. The delta change in parameter space can be approximated with first-order Taylor expansion:

\begin{aligned} \hat{y} = f (θ (T)) & \approx f (θ (0)) + \nabla_{θ} f (θ (0)) (θ (T) - θ (0)) \\ Thus Δ θ & = θ (T) - θ (0) \approx \frac{‖ \hat{y} - f (θ (0)) ‖}{‖ \nabla_{θ} f (θ (0)) ‖} \end{aligned}

Still following the first-order Taylor expansion, we can track the change in the differential of $f$ :

\begin{aligned} \nabla_{θ} f (θ (T)) & \approx \nabla_{θ} f (θ (0)) + \nabla_{θ}^{2} f (θ (0)) Δ θ \\ = \nabla_{θ} f (θ (0)) + \nabla_{θ}^{2} f (θ (0)) \frac{‖ \hat{y} - f (x; θ (0)) ‖}{‖ \nabla_{θ} f (θ (0)) ‖} \\ Thus Δ (\nabla_{θ} f) & = \nabla_{θ} f (θ (T)) - \nabla_{θ} f (θ (0)) = ‖ \hat{y} - f (x; θ (0)) ‖ \frac{\nabla_{θ}^{2} f (θ (0))}{‖ \nabla_{θ} f (θ (0)) ‖} \end{aligned}

Let $κ (θ)$ be the relative change of the differential of $f$ to the change in the parameter space:

κ (θ = \frac{Δ (\nabla_{θ} f)}{‖ \nabla_{θ} f (θ (0)) ‖} = ‖ \hat{y} - f (θ (0)) ‖ \frac{\nabla_{θ}^{2} f (θ (0))}{‖ \nabla_{θ} f (θ (0)) ‖^{2}}

Chizat et al. (2019) showed the proof for a two-layer neural network that $E [κ (θ_{0})] \to 0$ (getting into the lazy regime) when the number of hidden neurons $\to \infty$ . Also, recommend this post for more discussion on linearized models and lazy training.

Citation

Cited as:

Weng, Lilian. (Sep 2022). Some math behind neural tangent kernel. Lil’Log. https://lilianweng.github.io/posts/2022-09-08-ntk/.

@article{weng2022ntk,
  title   = "Some Math behind Neural Tangent Kernel",
  author  = "Weng, Lilian",
  journal = "Lil'Log",
  year    = "2022",
  month   = "Sep",
  url     = "https://lilianweng.github.io/posts/2022-09-08-ntk/"
}