3.9. Implementation of Multilayer Perceptron from Scratch¶

Now that we learned how multilayer perceptrons (MLPs) work in theory, let’s implement them. First, import the required packages or modules.

In [1]:

import sys
sys.path.insert(0, '..')

%matplotlib inline
import d2l
from mxnet import nd
from mxnet.gluon import loss as gloss

We continue to use the Fashion-MNIST data set. We will use the Multilayer Perceptron for image classification

In [2]:

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

3.9.1. Initialize Model Parameters¶

We know that the dataset contains 10 classes and that the images are of \(28 \times 28 = 784\) pixel resolution. Thus the number of inputs is 784 and the number of outputs is 10. Moreover, we use an MLP with one hidden layer and we set the number of hidden units to 256, but we could have picked some other value for this hyperparameter, too. Typically one uses powers of 2 since things align more nicely in memory.

In [3]:

num_inputs, num_outputs, num_hiddens = 784, 10, 256

W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens))
b1 = nd.zeros(num_hiddens)
W2 = nd.random.normal(scale=0.01, shape=(num_hiddens, num_outputs))
b2 = nd.zeros(num_outputs)
params = [W1, b1, W2, b2]

for param in params:
    param.attach_grad()

3.9.2. Activation Function¶

Here, we use the underlying maximum function to implement the ReLU, instead of invoking ReLU directly.

In [4]:

def relu(X):
    return nd.maximum(X, 0)

3.9.3. The model¶

As in softmax regression, using reshape we change each original image to a length vector of num_inputs. We then implement implement the MLP just as discussed previously.

In [5]:

def net(X):
    X = X.reshape((-1, num_inputs))
    H = relu(nd.dot(X, W1) + b1)
    return nd.dot(H, W2) + b2

3.9.4. The Loss Function¶

For better numerical stability, we use Gluon’s functions, including softmax calculation and cross-entropy loss calculation. We discussed the intricacies of that in the previous section. This is simply to avoid lots of fairly detailed and specific code (the interested reader is welcome to look at the source code for more details, something that is useful for implementing other related functions).

In [6]:

loss = gloss.SoftmaxCrossEntropyLoss()

3.9.5. Training¶

Steps for training the Multilayer Perceptron are no different from Softmax Regression training steps. In the d2l package, we directly call the train_ch3 function, whose implementation was introduced here. We set the number of epochs to 10 and the learning rate to 0.5.

In [7]:

num_epochs, lr = 10, 0.5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
              params, lr)

epoch 1, loss 0.7906, train acc 0.704, test acc 0.789
epoch 2, loss 0.4930, train acc 0.817, test acc 0.832
epoch 3, loss 0.4354, train acc 0.839, test acc 0.858
epoch 4, loss 0.3988, train acc 0.852, test acc 0.853
epoch 5, loss 0.3713, train acc 0.863, test acc 0.866
epoch 6, loss 0.3570, train acc 0.868, test acc 0.876
epoch 7, loss 0.3377, train acc 0.876, test acc 0.879
epoch 8, loss 0.3271, train acc 0.879, test acc 0.880
epoch 9, loss 0.3174, train acc 0.882, test acc 0.878
epoch 10, loss 0.3073, train acc 0.886, test acc 0.876

To see how well we did, let’s apply the model to some test data. If you’re interested, compare the result to corresponding linear model.

In [8]:

for X, y in test_iter:
    break

true_labels = d2l.get_fashion_mnist_labels(y.asnumpy())
pred_labels = d2l.get_fashion_mnist_labels(net(X).argmax(axis=1).asnumpy())
titles = [truelabel + '\n' + predlabel
          for truelabel, predlabel in zip(true_labels, pred_labels)]

d2l.show_fashion_mnist(X[0:9], titles[0:9])

../_images/chapter_deep-learning-basics_mlp-scratch_15_0.svg

This looks slightly better than before, a clear sign that we’re on to something good here.

3.9.6. Summary¶

We saw that implementing a simple MLP is quite easy, when done manually. That said, for a large number of layers this can get quite complicated (e.g. naming the model parameters, etc).

3.9.7. Problems¶

Change the value of the hyper-parameter num_hiddens in order to see the result effects.
Try adding a new hidden layer to see how it affects the results.
How does changing the learning rate change the result.
What is the best result you can get by optimizing over all the parameters (learning rate, iterations, number of hidden layers, number of hidden units per layer)?