back propagation

Back propagation is used in training a [[neural network]]. Using the [[chain rule]], the loss at each layer is propagated backward through the network with [[gradient descent]] to update the weights at each step. A [[computation graph]] is used to simplify calculations. ## simple backpropagation example Consider a simple linear neural net with two hidden layers. $ \begin{array}{cccc} & \bigcirc & \bigcirc & \bigcirc \\ & \vert & \vert & \vert \\ \bigcirc & \bigcirc & \bigcirc & \bigcirc \\ x & & & y \\ \end{array} $ Note that this network has three weights $W^1$, $W^2$, $W^3$ and three biases $b^1$, $b^2$, and $b^3$. Suppose that each hidden and output neuron is equipped with a sigmoid activation function given by $\sigma(z^L) = \frac{1}{1 + e_z}$ The loss function is given by $ \ell(y, a^4) = \frac{1}{2}(y - a^4)^2 $ where $a^4$ is the value of the activation at the output neuron and $y \in \{0,1\}$ is the true label associated with the training example. Recall that the activity $z^L$ is simply $z^L = W^{L-1} * a^{L-1} + b^{L-1}$ ### feed forward Let's find the pre-activation values $z^L$ (activities) and the post-activation values $a^L$ (activations) of the two unlabeled red circles and the output $y$, staring with the second layer $L=2$ and one training example $(x, y) = (0.5, 0)$. Suppose each of the weights is initialized to $W^k = 1.0$ and each bias is initialized to $b^k = -0.5$. --- $ \begin{array}{cccc} & \bigcirc & \bigcirc & \bigcirc \\ & \vert & \vert & \vert \\ \bigcirc & \circledcirc & \bigcirc & \bigcirc \\ x & & & y \\ \end{array} $ The activity $z^2$ of the second layer (first hidden node) with $x = 0.5$ is given by $ z^2 = W^1 * x + b^1 = 1 * (0.5) + (-0.5) = 0 $ Given a sigmoid activation function $\sigma$, the activation $a^2$ is $ a^2 = \frac{1}{1 + e^0} = 0.5 $ --- The activity at the next node is then $ \begin{array}{cccc} & \bigcirc & \bigcirc & \bigcirc \\ & \vert & \vert & \vert \\ \bigcirc & \bigcirc & \circledcirc & \bigcirc \\ x & & & y \\ \end{array} $ $ a^3 = \sigma(W^2 * a^1 + b^2) = \sigma(1 * 0.5 + (-0.5)) = \sigma(0) = 0.5 $ --- The pattern repeats for $a^4$ to get $\hat y=0.5$. $ \begin{array}{cccc} & \bigcirc & \bigcirc & \bigcirc \\ & \vert & \vert & \vert \\ \bigcirc & \bigcirc & \bigcirc & \circledcirc \\ x & & & y \\ \end{array} $ However, we know from the training data that $y$ should be $1$. ### backpropagation Let's use backpropagation to update the weights and biases. From above, we have $(x, y) =(0.5, 1)$, $W^L = 1.0$, $b^L = -0.5$, activities $z^2 = z^3 = z^4 = 0$, and activations $a^2 = a^3 = a^4 = 0.5$. ### calculating the weight updates Starting from the right-side, we can find the change in the loss function $\ell$ with respect to the change in $W^3$. $ \frac{\partial \ell}{\partial W^3} = \frac{\partial \ell}{\partial a^4} * \frac{\partial a^4}{\partial z^4} * \frac{\partial z^4}{\partial W^3} $ Breaking this down by component, we have $\frac{\partial \ell}{\partial a^4} = y - a^4$ $\frac{\partial a^4}{\partial z^4} = \sigma'(z^4) = \sigma(z^4)(1 - \sigma(z^4)$ $\frac{\partial z^4}{\partial W^3} = a^3$ Plugging in the values we calculate $\begin{align} \frac{\partial \ell}{\partial W^3} &= (y - a^4) * \sigma'(z^4) * a^3 \\ &= (1 - 0.5) * \sigma(0) * (1 - \sigma(0)) * 0.5 \\ &= 0.5 * 0.5 * (1 - 0.5) * 0.5 \\ &= 0.0625 \end{align} $ Using the chain rule to continue through the network for $W^2$, we have $ \frac{\partial \ell}{\partial W^2} = \frac{\partial \ell}{\partial a^4} * \frac{\partial a^4}{\partial z^4} * \frac{\partial z^4}{\partial a^3} * \frac{\partial a^3}{\partial z^3} * \frac{\partial z^3}{\partial W^2} $ We have already calculated some of these terms, but we need to find $ \frac{\partial z^4}{\partial a^3} = W^3 $ $\frac{\partial a^3}{\partial z^3} = \sigma'(z^3) = \sigma(z^3)(1 - \sigma(z^3)$ $ \frac{\partial z^3}{\partial W^2} = a^2 $ Plugging those values in we get $\begin{align} \frac{\partial \ell}{\partial W^2} &= (y - a^4) * \sigma'(z^4) * W^3 * \sigma'(z^3) * a^2 \\ &= 0.5 * 0.25 * 1 * 0.25 * 0.5 \\ &\approx 0.016 \end{align} $ The pattern continues such that $\begin{align} \frac{\partial \ell}{\partial W^1} &= (y - a^4) * \sigma'(z^4) * W^3 * \sigma'(z^3) * W^2 * \sigma'(z^2) * x \\ &\approx 0.004 \end{align} $ ### updating the weights The opposite of the cost vector $-C = [0.004, 0.016, 0.0625]$ gives us the amounts to adjust each weight using the formula $W^L := W^L + \alpha \frac{\partial \ell}{\partial W^L}$ where $\alpha$ is the learning rate. With $\alpha=0.1$, we can update the weights as follows $\begin{align} W^1 &= 1.0 - 0.1 * 0.004 = .9996 \\ W^2 &= 1.0 - 0.1 * 0.016 = .9984 \\ W^3 &= 1.0 - 0.1 * 0.0625 = .99375 \end{align}$ ### calculating the bias updates The bias updates are exactly like the weight updates except there is no need to scale the delta by the input since the partial derivative of $W * a +b$ with respect to $b$ is just $1$. We can use the same formulas above to compute: $ \frac{\partial \ell}{\partial b^L} = (y - a^4) \cdot \sigma'(z^L) \cdot \prod W^k \cdot \sigma'(z^k) $ for each weight $k$. Using the values we've already calculated for the deltas at each layer, we get $ \begin{align} \frac{\partial \ell}{\partial b^3} &= (y - a^4) \cdot \sigma'(z^4) = 0.5 \cdot 0.25 = 0.125 \\ \frac{\partial \ell}{\partial b^2} &= 0.5 \cdot 0.25 \cdot 1 \cdot 0.25 = 0.03125 \\ \frac{\partial \ell}{\partial b^1} &= 0.5 \cdot 0.25 \cdot 1 \cdot 0.25 \cdot 1 \cdot 0.25 = 0.0078125 \end{align} $ ### applying the bias updates We update the bias using the same gradient desent rule: $ b^L \gets b^L - \alpha \cdot \frac{\partial \ell}{\partial b^L} $ With learning rate $\alpha=0.1$, we get: $ \begin{align} b^1 &= -0.5 - 0.1 \cdot 0.0078125 = -0.50078125 \\ b^2 &= -0.5 - 0.1 \cdot 0.03125 = -0.503125 \\ b^3 &= -0.5 - 0.1 \cdot 0.125 = -0.5125 \end{align} $ ### next steps In this toy example, we might run the training data point through another **epoch** to update the weights again and help the model "memorize" this one data point. Depending on the learning rate, it might take 50 - 100 epochs for the weights to stabilize. In real life, we'll have many training examples and so we would feed all of those in and update the weights for the average cost vector across all examples. Again, we would repeat multiple epochs to improve the model. We would likely also increase the complexity of the architecture of the model, including more layers and more artificial neurons per layer. We could experiment with different activation functions, different loss functions like cross-entropy loss, and > [!Tip]- Additional Resources > - [Artem Kirsanov | The Most Important Algorithm in Machine Learning (YT)](https://youtu.be/SmZmBKc7Lrs?si=uWZOq_X37JoTkEGM) > - [3Blue1Brown | ]()