World's most popular travel blog for travel bloggers.

error computation in multi layered perceptron

, , No Comments
Problem Detail: 

I was reading about Multi Layered Perceptron(MLP) and how can we learn pattern using it. Algorithm was stated as

  • Initiate all weight to small values. Compute activation of each neuron
  • using sigmoid function. Compute the error at the output layer using
  • $\delta_{ok} = (t_{k} - y_{k})y_{k}(1-y_{k})$
  • compute error in hidden layer(s) using
  • $\delta_{hj} = a_{j}(1 -a_{j})\sum_{k}w_{jk}\delta_{ok}$
  • update output layer using using
  • $w_{jk} := w_{jk} + \eta\delta_{ok}a_{j} ^{hidden}$
  • and hidden layer weight using
  • $v_{ij} := v_{ij} + \eta\delta_{hj}x_{i}$

Where $a_{j}$ is activation function, $t_{k}$ is target function,$y_{k}$ is output function and $w_{jk}$ is weight of neuron between $j$ and $k$

My question is that how do we get that $\delta_{ok}$? and from where do we get $\delta_{hj}$? How do we know this is error? where does chain rule from differential calculus plays a role here?

Asked By : Sigma
Answered By : Martin Thoma

How do we get that $\delta_{ok}$?

You calculate the gradient of the network . Have a look at "Tom Mitchel: Machine Learning" if you want to see it in detail. In short, your weight update rule is

$$w \gets w + \Delta w$$ with the $j$-th component of the weight vector update being $$\Delta w^{(j)} = - \eta \frac{\partial E}{\partial w^{(j)}}$$ where $\eta \in \mathbb{R}_+$ is the learning rate and $E$ is the error of your network.

$\delta_{ok}$ is just $\frac{\partial E}{\partial w^{(o,k)}}$. So I guess $o$ is an output neuron and $k$ a neuron of the last hidden layer.

Where does the chain rule play a role?

The chain rule is applied to compute the gradient. This is where the "backpropagation" comes from. You first calculate the gradient of the last weights, then the weights before, ... This is done by applying the chain rule to the error of the network.


The weight initialization is not only small, but it has to be different for the different weights of one layer. Typically one chooses (pseudo)random weights.

See: Glorotm, Benigo: Understanding the difficulty of training deep feedforward neural networks

Best Answer from StackOverflow

Question Source :

3200 people like this

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback