Gradients, broadcasting and backpropagation

Let’s have \(y = w \cdot x + b\). Application of the rules of differentiation is simple, e.g.

\[\frac{\mathrm{d}y}{\mathrm{d}x} = w , \quad \frac{\mathrm{d}y}{\mathrm{d}w} = x, \quad \frac{\mathrm{d}y}{\mathrm{d}b} = 1\]

The change in \(y\) is porportional to the change in \(x\). The bigger is \(w\), the bigger is the change of \(y\) for the same change of \(x\).

Let’s \(L = f(y) = f(y(x))\). Application of the chain rule is simple, e.g.

\[\frac{\mathrm{d}L}{\mathrm{d}x} = \frac{\mathrm{d}f(y)}{\mathrm{d}y} \cdot \frac{\mathrm{d}y}{\mathrm{d}x} = \frac{\mathrm{d}L}{\mathrm{d}y} \cdot w\] \[\frac{\mathrm{d}L}{\mathrm{d}w} = \frac{\mathrm{d}f(y)}{\mathrm{d}y} \cdot \frac{\mathrm{d}y}{\mathrm{d}w} = \frac{\mathrm{d}L}{\mathrm{d}y} \cdot x\] \[\frac{\mathrm{d}L}{\mathrm{d}b} = \frac{\mathrm{d}f(y)}{\mathrm{d}y} \cdot \frac{\mathrm{d}y}{\mathrm{d}b} = \frac{\mathrm{d}L}{\mathrm{d}y} \cdot 1\]

The multidimensional case is not so simple. Functions with multiple inputs and multiple outputs have multiple partial derivatives which need to be arranged and stored properly. Applying this for batches of data complicates the picture even more.

Derivative of a transformation

The derivative of a function (transformation \(\psi\)) with multiple inputs \(\mathbf{x} \in \mathbb{R}^M\) and multiple outputs \(\mathbf{y} \in \mathbb{R}^H\) is a matrix containing the partial derivatives of each output with respect to each input (the so called Jacobian of the transformation, \(\frac{\partial{\mathbf{y}}}{\partial{\mathbf{x}}} \in \mathbb{R}^{M \times H}\)). For example, if \(M=4\) and \(H=2\) we can write:

\(\mathbf{y} = \psi(\mathbf{x})\),

\(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4\end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix}\)

\(\frac{\partial{\mathbf{y}}}{\partial{\mathbf{x}}} = \begin{bmatrix} \frac{\partial{y_1}}{\partial{x_1}} & \frac{\partial{y_1}}{\partial{x_2}} & \frac{\partial{y_1}}{\partial{x_3}} & \frac{\partial{y_1}}{\partial{x_4}}\\ \frac{\partial{y_2}}{\partial{x_1}} & \frac{\partial{y_2}}{\partial{x_2}} & \frac{\partial{y_2}}{\partial{x_3}} & \frac{\partial{y_2}}{\partial{x_4}} \end{bmatrix}\)

The derivative of \(\mathbf{y = W \cdot x}\) with respect to \(\mathbf{x}\) is \(\mathbf{W}\), i.e. \(\frac{∂y}{∂x} = W\).

We should note that in neural networks the input and output features are arranged as raw vectors.

The derivative of \(\mathbf{y = x \cdot W}\) with respect to \(\mathbf{x}\) is \(\mathbf{W^T}\), i.e.

\[\frac{∂y}{∂x} = W^T\]

The chain rule

Let’s have \(y = x \cdot W + b\) and \(L = f(y)\)

The chain rule involves propagating the gradient of the loss layer after layer backward towards the inputs and parameters of interest. In our demonstration case in order to calculate the gradient of \(L\) with respect to the input \(x\) we need to have the gradient of \(L\) with respect to the output \(y\).

The shapes of the gradient is the same as the shape of the corresponding variable (parameter)

A gradient is attached to each variable and parameter of the model, i.e.

\(y.g = \frac{\partial{L}}{∂{y}}\)

\(x.g = \frac{\partial{L}}{∂{x}} = \frac{\partial{L}}{\partial{y}} \cdot \frac{\partial{y}}{\partial{x}} = y.g \cdot \frac{\partial{y}}{\partial{x}} = y.g \cdot W^T\)

\(b.g = \frac{\partial{L}}{∂{b}} = \frac{\partial{L}}{\partial{y}} \cdot \frac{\partial{y}}{\partial{b}} = y.g \cdot \frac{\partial{y}}{\partial{b}} = y.g\)

\(W.g = \frac{\partial{L}}{∂W} = ((\frac{\partial{L}}{\partial{y}})^T \cdot \frac{\partial{y}}{\partial{W}})^T = (y.g^T \cdot x)^T = x^T \cdot y.g\)

Notes:

\(x\), \(y\) and \(b\) are row vectors.
\(W\) is a weight matrix with \(m\) rows and \(h\) columns;
\(x\) includes the \(m\) input features
\(b\) is a bias with \(h\) elements;
\(y\) has \(h\) features (or nodes).
\(x\) and \(y\) represent input and output features (variables, nodes in the NN). Adding additional dimension (multiple rows) could represent multiple data samples. Inputs and outputs could be replaced by matrices \(X\) and \(Y\) where the last dimension gives the features (\(x\) and \(y\) for the corresponding data point);

Implementation and testing with code

Above equations are implemented in function lin_grad and tested in the following code

def lin_grad(x, w, b, y):
    # y.g shoudl be available!
    b.g = y.g.sum(dim=0)#/y.shape[0]
    w.g = x.T @ y.g
    x.g = y.g @ w.T

import torch
from torch import tensor
from matplotlib import pyplot as plt
%matplotlib inline

# Generate test input and ouput data
N = 100 # Number of samples
M = 4 # Number of input features
H = 2 # Number of outputs
x = torch.rand((N, M))*10 - 5 # Input
k1, k2, k3, k4, k6, k7 = 1, 1.5, 2, 2.5, 3.0, -0.5
W_true = torch.tensor([[k1, k1],
                  [k2, k2],
                  [k3, k6],
                  [k4, k7]])
b_true = torch.tensor([0, -1])
y = x @ W_true + b_true # Output
W_true, y[:7]

(tensor([[ 1.0000,  1.0000],
         [ 1.5000,  1.5000],
         [ 2.0000,  3.0000],
         [ 2.5000, -0.5000]]),
 tensor([[-11.5291, -12.7022],
         [ -3.9763, -22.8349],
         [  2.2930,   7.2452],
         [ -4.9708,   6.3649],
         [-11.7199, -24.9706],
         [ 14.8387,   5.4946],
         [ -9.7971,  -0.4925]]))

plt.scatter(x[:,2], y[:,1]);
plt.xlabel('Input, dimension (feature) No3');
plt.ylabel('Output, dim. No2');

# Generate random weights
w = torch.randn(M,H)
b = torch.zeros(H)
w, b

(tensor([[-0.0383, -0.8173],
         [ 0.8458, -2.0662],
         [ 0.1153,  0.0775],
         [ 1.0845, -0.1016]]),
 tensor([0., 0.]))

In order to test the function lin_grad we need to calculate the gradient \(∂L/dy\) and save it in y_pred.g (the gradient with respect to the prediction):

define model and calculate prediction
define and calculate loss \(L\) as simple mean squared error mse(y_pred, y_targ)
define and run mse_grad(y_pred, y_targ)
run lin_grad(x, w, b, y)

def lin(x, w, b): return x @ w + b

y_pred = lin(x, w, b)
y_pred[:5]

tensor([[ 0.0472, -0.5698],
        [ 1.4009,  5.7816],
        [-3.4003,  1.4367],
        [-0.9559, -5.0894],
        [-0.9125,  9.5171]])

def mse(y_pred, y_targ): return (y_pred-y_targ).pow(2).mean()

loss = mse(y_pred, y)
loss

tensor(181.8108)

def mse_grad(y_pred, y_targ): y_pred.g = 2 * (y_pred - y_targ) / y_targ.shape[0] / y_targ.shape[1]

mse_grad(y_pred, y)

Finnally, test if all dimensions in lin_grad match:

lin_grad(x, w, b, y_pred)

Next test if the loss improves

w -= 0.1 * w.g
b -= 0.1 * b.g
y_pred = lin(x, w, b)
loss = mse(y_pred, y)
loss

tensor(9.2100)

for i in range(20):
    y_pred = lin(x, w, b)
    loss = mse(y_pred, y)
    mse_grad(y_pred, y)
    lin_grad(x, w, b, y_pred)
    w -= 0.01 * w.g
    b -= 0.01 * b.g
    print(loss, end=', ')
print('End!\n')
print('Weights and biases of the network and for the dataset:')
w, b, W_true, b_true

tensor(9.2100), tensor(7.3055), tensor(5.8198), tensor(4.6593), tensor(3.7515), tensor(3.0402), tensor(2.4818), tensor(2.0425), tensor(1.6961), tensor(1.4221), tensor(1.2048), tensor(1.0318), tensor(0.8936), tensor(0.7827), tensor(0.6933), tensor(0.6208), tensor(0.5616), tensor(0.5130), tensor(0.4729), tensor(0.4394), End!

Weights and biases of the network and for the dataset:

(tensor([[ 1.0095,  0.9741],
         [ 1.5336,  1.5728],
         [ 1.9631,  3.0287],
         [ 2.3814, -0.5464]]),
 tensor([ 0.0677, -0.2069]),
 tensor([[ 1.0000,  1.0000],
         [ 1.5000,  1.5000],
         [ 2.0000,  3.0000],
         [ 2.5000, -0.5000]]),
 tensor([ 0, -1]))

PyTorch autograd and backpropagation

We will use PyTorch automatic gradient calculation to check our algorithms. This involves using the build-in methods and parameters .backward() and .grad.

In order to apply PyTorch backpropagation and autograd we need to define a forward function that relates the inputs with the loss:

def forward(x, y):
    y_pred = lin(x, w, b)
    loss =  mse(y_pred, y)
    return loss

loss = forward(x, y)
loss

tensor(0.4113)

But this is not enough:

    loss.backward()

–> RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Looks like a slot for saving the gradient to the corresponding inputs and model parameters should be required.

# Update gradients as above - manual backward - for comparison

# forward
y_pred = lin(x, w, b)
loss =  mse(y_pred, y)

# backward
mse_grad(y_pred, y)
lin_grad(x, w, b, y_pred)

# Good to know: the parameters are not updated in the backward pass!

# Require gradient to keep them with the data

for element in [x, w, b]:
    element.requires_grad_(True)

loss = forward(x, y)
loss.backward()

loss

tensor(0.4113, grad_fn=<MeanBackward0>)

from fastcore.test import test_close

test_close(x.g, x.grad)
test_close(w.g, w.grad)
test_close(b.g, b.grad)

What is next?

The next step is to try to create a “proper” non-linear neural network.

The structure of a fully connected neural network with single hidden layer could be represented as follows:

A few more gradients need to be defined and calculated