How to build a NN model? When you drive a car you don’t see all car and engine parameters. Even when an engineer designs a car he don’t know details about each component (mechanical, electrical, electro-mechanical or electronic part). This is done to manage the complexity. This approach is especially important for software design where the number of components is much greater and design flexibility is colse to infinity.

The code was adapted from fastai course 2022, part 2 while studying notebooks 03_backprop.ipynb and 04_minibatch_training.ipynb.

Layers

Some of the most widely used layers in a NN are Linear, ReLU and Mse.

The weights of the layer are initiated when the object is created and are defined in the __init__ method. The relationship for calculating the output from the input are defined in the __call__ layer (the “forward” feature). The functionality for calculation of the gradients is provided in the backward method. For example:

class Lin():
    def __init__(self, w, b):
        self.w = w
        self.b = b
    def __call__(self, x):
        y = x @ self.w + self.b
        self.x = x
        self.y = y
        return self.y
    def backward(self):        
        self.b.g = self.y.g.sum(dim=0)
        self.w.g = self.x.T @ self.y.g
        self.x.g =  self.y.g @ self.w.T

class ReLU():
    def __call__(self, x):
        self.x = x
        self.y = self.relu(x)
        return self.y
    def relu(self, x): return x.clamp_min(0.)
    def backward(self): self.x.g = self.y.g * (self.x > 0)

class Mse():
    def __call__(self, pred, y):
        self.pred = pred
        self.targ = y
        return (pred-y).pow(2).mean()
    def backward(self):
        self.pred.g = 2 * (self.pred - self.targ) / self.targ.shape[0] / self.targ.shape[1]

The following code demonstrates the creation and use of a linear layer object. Look for:

Initialization: lin = Lin(w, b
Forward application: y = lin(x)
Backpropagation: lin.backward()

import torch
from fastcore.test import test_close

# Generate random weights for the model
M = 4 # Number of input features
H = 2 # Number of outputs
w = torch.randn(M,H)
b = torch.zeros(H)

# Create linear layer object
lin = Lin(w, b)

# Simulate inputs
N = 3 # Number of samples
x = torch.rand((N, M))*10 - 5 # just random numbers as example

# Calculate the output of the layer
y = lin(x)

# Show inputs and outputs
print('x = \n', x, '\ny = \n', y)

x = 
 tensor([[ 3.6329, -1.5028,  1.5645, -2.5170],
        [ 4.4755,  3.9404, -3.6032, -0.0517],
        [ 1.0234, -3.7332,  3.1473,  0.6829]]) 
y = 
 tensor([[ 0.6308, -2.3627],
        [ 4.0016, -2.5262],
        [-1.7378,  1.8884]])

# Provide output gradients (usually based on the Loss: y.g = dL/dy)
y.g = torch.rand(y.shape) # output gradients are needed in order for backpropagation to work

# Calculate (backpropagate) gradients
lin.backward()

# Show gradients
print('x.g = \n', x.g, '\nw.g = \n', w.g, '\nb.g = \n', b)

x.g = 
 tensor([[0.2319, 0.9382, 0.8166, 0.7303],
        [0.2856, 0.8208, 0.5457, 0.4417],
        [0.2376, 0.8698, 0.7110, 0.6232]]) 
w.g = 
 tensor([[ 8.1594,  5.1629],
        [-0.9825, -1.6451],
        [ 0.8388,  1.4778],
        [-1.6999, -1.4052]]) 
b.g = 
 tensor([0., 0.])

relu = ReLU()
y_relu = relu(x)
x, y_relu

(tensor([[ 3.6329, -1.5028,  1.5645, -2.5170],
         [ 4.4755,  3.9404, -3.6032, -0.0517],
         [ 1.0234, -3.7332,  3.1473,  0.6829]]),
 tensor([[3.6329, 0.0000, 1.5645, 0.0000],
         [4.4755, 3.9404, 0.0000, 0.0000],
         [1.0234, 0.0000, 3.1473, 0.6829]]))

y_relu.g = torch.rand(y_relu.shape)
relu.backward()
x_relu_g = x.g

Models

A model is an arrangement of one or more layers. It contains all the weights and relationships that allow an input to be transformed into an output and loss and the derivative of the loss to be back-propagated to the model weights and inputs. Each layer can be considered a simple model. Each model have the same general methods as a layer.

A kind of a distinction between a model and layer could be introduced if the explicit output of the model is restricted to be a scalar - the loss. But the loss is quite often calculated separately over the outputs of the model in which case the models are just more complex layers.

The simplest arrangement is a sequence of layers where the output of each layer (except the last one) is input to the next layer.

# A model with just a Linear, ReLU and Loss layers

# The output of the model is the loss. 
# In additon, the output of the last layer (usually the loss is not counted as a layer)
# is saved as model attribure self.y

class Model():
    def __init__(self, w, b):
        self.layers = [Lin(w, b), ReLU()]
        self.loss = Mse()
        
    def __call__(self, x, targ):
        for l in self.layers: 
            x = l(x)
        self.y = x
        return self.loss(x, targ)
    
    def backward(self):
        self.loss.backward()
        for l in reversed(self.layers): 
            l.backward()

model = Model(w, b)

# Simulate target values (needed to calculate the loss)
k1, k2, k3, k4, k6, k7 = 1, 1.5, 2, 2.5, 3.0, -0.5
W_true = torch.tensor([[k1, k1],
                  [k2, k2],
                  [k3, k6],
                  [k4, k7]])
b_true = torch.tensor([0, -1])
y_target = x @ W_true + b_true # Output

loss = model(x, y_target)
loss

tensor(10.5302)

out = model.y
out

tensor([[0.6308, 0.0000],
        [4.0016, 0.0000],
        [0.0000, 1.8884]])

model.backward()

# Show gradients
print('x.g = \n', x.g, '\nw.g = \n', w.g, '\nb.g = \n', b)

x.g = 
 tensor([[ 0.3201,  0.5563,  0.1117, -0.0025],
        [ 0.1260,  0.2190,  0.0440, -0.0010],
        [ 0.0929, -0.2466, -0.5286, -0.5591]]) 
w.g = 
 tensor([[ 4.3441, -0.5581],
        [ 0.0389,  2.0357],
        [ 0.1176, -1.7162],
        [-2.0432, -0.3724]]) 
b.g = 
 tensor([0., 0.])

The Module class and class inheritance

Above classes can be based on a more general class so and more information can be hidden, e.g. saving parameter gradients, saving inputs and outputs, etc. Only the initialization and the functions needed to do the forward and backward pass need to be redefined. All we have is modules and submodules.

class Module():        
    def __call__(self, *x):
        self.x = x
        self.y = self.forward(*x)
        return self.y    
    def backward(self):
        self.bwd(self.y, *self.x)
    def forward(self):
        raise Exception('Not implemented')    
    def bwd(self):
        raise Exception('Not implemented')

class Lin(Module):
    def __init__(self, w, b):
        self.w = w
        self.b = b
    def forward(self, x):
        y = x @ self.w + self.b
        return y
    def bwd(self, y, x):        
        self.b.g = y.g.sum(dim=0)
        self.w.g = x.T @ y.g
        x.g =  y.g @ self.w.T

# Create linear layer object
lin = Lin(w, b)

# Calculate the output of the layer
y2 = lin(x)

# Test results
test_close(y, y2)

lin.y.g = torch.rand(lin.y.shape)
lin.backward()

test_close(x.g, lin.x[0].g)
test_close(w.g, lin.w.g)
test_close(b.g, lin.b.g)

class ReLU(Module):
    def forward(self, x): return x.clamp_min(0.)
    def bwd(self, y, x): x.g = y.g * (x > 0)

relu = ReLU()
test_close( relu(x), y_relu)

relu.y.g = y_relu.g
relu.backward()
test_close(relu.x[0].g, x_relu_g)

class Mse(Module):
    def forward(self, pred, y): return (pred-y).pow(2).mean()
    def bwd(self, out, pred, targ):
        pred.g = 2 * (pred - targ) / targ.shape[0] / targ.shape[1]

model = Model(w, b)

test_close(model(x, y_target), loss)

model.backward()

# Show gradients
print('x.g = \n', x.g, '\nw.g = \n', w.g, '\nb.g = \n', b)

x.g = 
 tensor([[ 0.3201,  0.5563,  0.1117, -0.0025],
        [ 0.1260,  0.2190,  0.0440, -0.0010],
        [ 0.0929, -0.2466, -0.5286, -0.5591]]) 
w.g = 
 tensor([[ 4.3441, -0.5581],
        [ 0.0389,  2.0357],
        [ 0.1176, -1.7162],
        [-2.0432, -0.3724]]) 
b.g = 
 tensor([0., 0.])

PyTorch layers and Module class

The above Module will be replaced by the standard PyTorch class nn.Module. The autograd and backpropagation features of PyTorch will be used to remove the need for defining the bwd or the backwardmethods. The Lin layer will be redefined to inherit nn.Module. The ReLU() and Mse() layers will be replaced by nn.ReLU() and nn.MSELoss() respectively.

from torch import nn

class Lin(nn.Module):
    def __init__(self, w, b):
        super().__init__()
        self.w = w.requires_grad_(True)
        self.b = b.requires_grad_(True)
    def forward(self, x):
        y = x @ self.w + self.b
        return y

# Create linear layer object
lin = Lin(w, b)

# Calculate the output of the layer
y2 = lin(x)

# Test results
test_close(y, y2)

y2.grad = torch.rand(y.shape) # output gradients are needed in order for backpropagation to work

# Calculate (backpropagate) gradients
# y2.backward()

# RuntimeError: grad can be implicitly created only for scalar outputs

class Model(nn.Module):
    def __init__(self, w, b):
        super().__init__()
        self.layers = [Lin(w, b), nn.ReLU()]
        self.loss = nn.MSELoss()        
    def forward(self, x, targ):
        for l in self.layers: 
            x = l(x)
        self.y = x
        return self.loss(x, targ)

model = Model(w, b)

loss = model(x, y_target)

loss

tensor(10.5302, grad_fn=<MseLossBackward0>)

loss.backward()

test_close(w.grad, w.g)

test_close(b.grad, b.g)

x.grad
# None because `x.requires_grad_(True)` was never called

x.requires_grad_(True)

tensor([[ 3.6329, -1.5028,  1.5645, -2.5170],
        [ 4.4755,  3.9404, -3.6032, -0.0517],
        [ 1.0234, -3.7332,  3.1473,  0.6829]], requires_grad=True)

# gradients accumulate
w.grad.zero_()
b.grad.zero_()

tensor([0., 0.])

loss = model(x, y_target)

loss.backward()

x.grad

tensor([[ 0.3201,  0.5563,  0.1117, -0.0025],
        [ 0.1260,  0.2190,  0.0440, -0.0010],
        [ 0.0929, -0.2466, -0.5286, -0.5591]])

test_close(x.grad, x.g)

test_close(w.grad, w.g)

Some more details in PyTorch

# notice that the layers of the model are not properly registered and accessible
model

Model(
  (loss): MSELoss()
)

model.layers[1](x)

tensor([[3.6329, 0.0000, 1.5645, 0.0000],
        [4.4755, 3.9404, 0.0000, 0.0000],
        [1.0234, 0.0000, 3.1473, 0.6829]], grad_fn=<ReluBackward0>)

list(model.named_children())

[('loss', MSELoss())]

list(model.parameters())

[]

for p in model.parameters(): print(p.shape)

class Model(nn.Module):
    def __init__(self, w, b):
        super().__init__()
        layers = [Lin(w, b), nn.ReLU()]
        self.layers = nn.ModuleList(layers)
        self.loss = nn.MSELoss()        
    def forward(self, x, targ):
        for l in self.layers: 
            x = l(x)
        self.y = x
        return self.loss(x, targ)

model = Model(w, b)

list(model.named_children())

[('layers',
  ModuleList(
    (0): Lin()
    (1): ReLU()
  )),
 ('loss', MSELoss())]

list(model.parameters())

[]

model.layers[0].w

tensor([[ 0.3975, -0.1704],
        [ 0.6908,  0.4523],
        [ 0.1387,  0.9694],
        [-0.0032,  1.0253]], requires_grad=True)

See 04_minibatch_training.ipynb to learn how to use parameters and set attrivutes and register modules (sections Using parameters and optim)

Designing models and layers with PyTorch and nn.Sequential()

model = nn.Sequential(nn.Linear(M,H), nn.ReLU(), nn.Linear(H,10))

model

Sequential(
  (0): Linear(in_features=4, out_features=2, bias=True)
  (1): ReLU()
  (2): Linear(in_features=2, out_features=10, bias=True)
)

layers = [nn.Linear(M,H), nn.ReLU(), nn.Linear(H,10)]
model = nn.Sequential(*layers)

model

Sequential(
  (0): Linear(in_features=4, out_features=2, bias=True)
  (1): ReLU()
  (2): Linear(in_features=2, out_features=10, bias=True)
)

for p in model.parameters(): print(p.shape)

torch.Size([2, 4])
torch.Size([2])
torch.Size([10, 2])
torch.Size([10])

list(model.parameters())

[Parameter containing:
 tensor([[-0.2547,  0.3678, -0.3120, -0.4402],
         [-0.0581,  0.0541, -0.0433,  0.2772]], requires_grad=True),
 Parameter containing:
 tensor([ 0.2642, -0.2418], requires_grad=True),
 Parameter containing:
 tensor([[ 0.1416,  0.5259],
         [ 0.1009, -0.4274],
         [-0.2409,  0.4319],
         [-0.2980,  0.1193],
         [ 0.3526,  0.3844],
         [-0.3655, -0.0480],
         [ 0.2450, -0.4946],
         [-0.3897,  0.2618],
         [-0.1028,  0.5648],
         [ 0.5452, -0.5314]], requires_grad=True),
 Parameter containing:
 tensor([ 0.3573,  0.3235, -0.6281, -0.4056, -0.6562, -0.1154,  0.0053, -0.3871,
          0.4785, -0.0720], requires_grad=True)]

# Example of a custom layer
# Global Average Pooling Layer (Adaptive Average Pooling Layer)

class GlobalAvgPooling(nn.Module):
    def forward(self, x):
        return x.mean((-2, -1))