Lesson 36 • Physics 356

Backpropagation & Training Neural Networks

⏱ ~50 min read

Learning Objectives

Notation used in this lesson

\(\delta^{[\ell]}\)Error signal at layer \(\ell\) — the gradient of \(L\) w.r.t. the pre-activation \(z^{[\ell]}\)
\(\frac{\partial L}{\partial \mathbf{W}^{[\ell]}}\)Gradient of loss w.r.t. the weight matrix at layer \(\ell\)
\(\phi'\)Derivative of the activation function
\(\odot\)Element-wise (Hadamard) product
\(\alpha\)Learning rate: controls the step size of each parameter update
\(\mathbf{v}\)Velocity vector in momentum-based optimization
\(B\)Mini-batch size (number of training samples per gradient update)
epochOne complete pass through the entire training dataset

🌍 Background & Motivation

Closing the loop

In Lesson 30 we introduced gradient descent: given the gradient of a loss function, we update parameters by stepping opposite the gradient. In Lesson 35 we built neural networks with multiple layers, each with its own weights \(\mathbf{W}^{[\ell]}\) and biases \(\mathbf{b}^{[\ell]}\). When we initialize a network, we often assign random weights and biases to each layer, which produce a poor output according to our loss function. The missing piece that we will address in this lesson is: how do we efficiently compute the gradient of the loss function with respect to every weight and bias in every layer, and how do we update those parameters to obtain better values that produce a lower loss?

Backpropagation is the answer. It is an efficient application of the chain rule from calculus, applied layer-by-layer from the output back to the input. Backpropagation is not a learning algorithm itself — it is a gradient computation algorithm. Gradient descent then uses those gradients to update parameters.

The full training loop for a neural network: (1) Forward pass → (2) Compute loss → (3) Backward pass (backpropagation) → (4) Gradient descent update → repeat until convergence.

Historical note & the AI boom

The backpropagation algorithm was popularized by Rumelhart, Hinton, and Williams in their landmark 1986 Nature paper, which demonstrated that multi-layer networks could be trained by propagating error signals backward through the network. Before this, there was no practical way to efficiently compute gradients for hidden layers, which meant deep networks were essentially untrained potential.

Despite the 1986 breakthrough, true deep learning remained dormant for two decades due to computational and data limitations. The AI boom of the 2010s was ignited by three converging factors:

Key insight: Backpropagation did not change between 1986 and 2012. What changed was the hardware and data that unlocked its full potential.

🧠 The Chain Rule & Computational Graphs

Chain rule review

If our loss function \(L\) depends on an intermediate quantity \(z\), which in turn depends on a weight \(w\), the chain rule gives:

\[ \frac{dL}{dw} = \frac{dL}{dz} \cdot \frac{dz}{dw} \]

For a chain of functions \(L = f_n(f_{n-1}(\cdots f_1(w) \cdots))\):

\[ \frac{dL}{dw} = \frac{\partial f_n}{\partial f_{n-1}} \cdot \frac{\partial f_{n-1}}{\partial f_{n-2}} \cdots \frac{\partial f_1}{\partial w} \]

Backpropagation applies this rule systematically, layer by layer from output to input, accumulating gradient contributions at each step.

Computational graphs

A computational graph represents a mathematical expression as a directed graph: each node is an operation, each edge carries a value forward and a gradient backward.

Consider the simplest possible network: a single neuron with two inputs, no bias, ReLU activation, and L2 (MSE) loss:

x₁ · w₁ → a₁ ╲
        [Σ] → z → [ReLU] → ŷ → [½(y−ŷ)²] → L
x₂ · w₂ → a₂ ╱

The forward pass evaluates the loss for a given input by evaluating the network left-to-right, computing and storing all intermediate values. The backward pass propagates the derivatives of the loss with respect to each node \(\frac{\partial L}{\partial \text{each node}}\) right-to-left using the chain rule. This is the backpropagation algorithm. It provides the gradients of each parameter in the network with respect to the loss function. In other words, it tells us how much each parameter contributed to the loss, and we can use this to update the parameters to minimize the loss.

Simple numerical walkthrough

Let's trace both passes through the single-neuron graph with concrete numbers.

Given:

\[ x_1 = 2,\quad x_2 = 1,\quad w_1 = 0.3,\quad w_2 = 0.1,\quad y = 2\;(\text{target}) \]

Forward pass:

\[ a_1 = w_1 x_1 = 0.6,\qquad a_2 = w_2 x_2 = 0.1 \] \[ z = a_1 + a_2 = 0.7,\qquad \hat{y} = \text{ReLU}(0.7) = 0.7 \] \[ L = \tfrac{1}{2}(y - \hat{y})^2 = \tfrac{1}{2}(2 - 0.7)^2 = \tfrac{1}{2}(1.69) = 0.845 \]

Backward pass — apply chain rule right-to-left:

\[ \frac{\partial L}{\partial \hat{y}} = -(y - \hat{y}) = -(2 - 0.7) = -1.3 \] \[ \frac{\partial \hat{y}}{\partial z} = 1\quad\text{(ReLU derivative, since } z=0.7 > 0\text{)} \] \[ \frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} = -1.3 \times 1 = -1.3 \] \[ \frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial z}\cdot\frac{\partial z}{\partial w_1} = -1.3\times x_1 = -1.3\times 2 = -2.6 \] \[ \frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial z}\cdot\frac{\partial z}{\partial w_2} = -1.3\times x_2 = -1.3\times 1 = -1.3 \]

Gradient descent update with \(\alpha = 0.1\):

\[ w_1^{\text{new}} = 0.3 - 0.1(-2.6) = 0.56,\qquad w_2^{\text{new}} = 0.1 - 0.1(-1.3) = 0.23 \]

Verify loss decreased:

\[ z^{\text{new}} = 0.56(2) + 0.23(1) = 1.35,\quad \hat{y}^{\text{new}} = 1.35,\quad L^{\text{new}} = \tfrac{1}{2}(2-1.35)^2 = 0.211 \]
It works! One gradient descent step reduced the loss from \(0.845\) to \(0.211\). The negative gradients indicated we should increase both weights to bring \(\hat{y}\) closer to \(y=2\), which is exactly what happened.

🔀 Backpropagation in a Multi-Layer Network

The single-neuron example above traced the chain rule through 4 operations. A multi-layer network is the same idea — just more links in the chain. The key advantage of backpropagation is that once you compute each error signal \(\delta^{[\ell]}\), you can reuse it to compute both the weight gradient and the error signal for the layer below.

Two-layer network with L2 loss

Consider: \(\mathbf{x} \xrightarrow{[\mathbf{W}^{[1]},\,\mathbf{b}^{[1]}]} \text{ReLU} \xrightarrow{[\mathbf{W}^{[2]},\,b^{[2]}]} \hat{y} \xrightarrow{} L\)

Forward pass:

\[ \mathbf{z}^{[1]} = \mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]},\qquad \mathbf{a}^{[1]} = \text{ReLU}\!\left(\mathbf{z}^{[1]}\right) \] \[ \hat{y} = z^{[2]} = \mathbf{W}^{[2]}\mathbf{a}^{[1]} + b^{[2]}\qquad\text{(linear output)} \] \[ L = \tfrac{1}{2}(y - \hat{y})^2\qquad\text{(L2 / MSE loss)} \]

Backward pass — deriving each error signal step by step:

\[ \delta^{[2]} = \frac{\partial L}{\partial z^{[2]}} = \frac{\partial L}{\partial \hat{y}}\cdot\frac{\partial \hat{y}}{\partial z^{[2]}} = -(y-\hat{y})\cdot 1 = \hat{y} - y \]
\[ \frac{\partial L}{\partial \mathbf{W}^{[2]}} = \delta^{[2]}\,(\mathbf{a}^{[1]})^\top,\qquad \frac{\partial L}{\partial b^{[2]}} = \delta^{[2]} \]
\[ \delta^{[1]} = \frac{\partial L}{\partial \mathbf{z}^{[1]}} = \underbrace{(\mathbf{W}^{[2]})^\top\delta^{[2]}}_{\text{backprop through }\mathbf{W}^{[2]}} \;\odot\; \underbrace{\mathbf{1}[\mathbf{z}^{[1]}>0]}_{\text{ReLU gate}} \]
\[ \frac{\partial L}{\partial \mathbf{W}^{[1]}} = \delta^{[1]}\,\mathbf{x}^\top,\qquad \frac{\partial L}{\partial \mathbf{b}^{[1]}} = \delta^{[1]} \]
Key pattern: Each layer's error signal \(\delta^{[\ell]}\) is computed from the layer above's signal \(\delta^{[\ell+1]}\) multiplied back through the weight matrix, then gated by the activation derivative. This is the "backpropagated error" interpretation — each layer receives its share of blame from the layer above.
Note on classification: For multi-class classification with softmax output and cross-entropy loss, the formula simplifies to \(\delta^{[L]} = \hat{\mathbf{p}} - \mathbf{y}\) (predicted probabilities minus one-hot target). The remainder of the backprop equations are identical. MATLAB's patternnet handles this automatically.

The ReLU derivative

\(\text{ReLU}'(z) = \mathbf{1}[z>0]\) — it is 1 where the neuron was active during the forward pass, and 0 where it was not. The term \(\odot\,\mathbf{1}[\mathbf{z}^{[1]}>0]\) in \(\delta^{[1]}\) "gates" the gradient: only neurons that fired receive gradient signal. Neurons that did not fire have zero gradient — they receive no update this iteration.

Dying ReLU: If a neuron's pre-activation is always negative across all training samples, its gradient is always 0 — it never updates and effectively dies. Careful weight initialization (e.g., Xavier or He) and moderate learning rates prevent this.

The general backpropagation algorithm

For a network with \(L\) layers and a mini-batch of \(B\) examples:

  1. Forward pass: Compute and store \(\mathbf{z}^{[\ell]}\) and \(\mathbf{a}^{[\ell]}\) for all \(\ell=1,\ldots,L\). Storage is necessary — the backward pass reuses these values.
  2. Compute output error signal: \(\delta^{[L]} = \nabla_{z^{[L]}} L\) (depends on loss and output activation choice).
  3. Backpropagate from \(\ell=L-1\) down to \(1\): \(\;\delta^{[\ell]} = (\mathbf{W}^{[\ell+1]})^\top\delta^{[\ell+1]} \odot \phi'(\mathbf{z}^{[\ell]})\).
  4. Compute weight gradients: \(\;\dfrac{\partial L}{\partial \mathbf{W}^{[\ell]}} = \dfrac{1}{B}\,\delta^{[\ell]}(\mathbf{A}^{[\ell-1]})^\top\).
  5. Update parameters with gradient descent: \(\;\mathbf{W}^{[\ell]} \leftarrow \mathbf{W}^{[\ell]} - \alpha\,\dfrac{\partial L}{\partial \mathbf{W}^{[\ell]}}\).

🎲 Stochastic Gradient Descent (SGD)

What does "stochastic" mean?

In Lesson 30, gradient descent computed the exact gradient using all \(N\) training samples. In machine learning, \(N\) can be millions — computing the exact gradient over the entire dataset for every update is prohibitively expensive.

Stochastic means random. Rather than using all \(N\) samples, we randomly select a small subset (\(B\) samples, called a mini-batch) and compute an approximate gradient from that subset. We then take a step using this noisy approximation and move on to the next random mini-batch.

What is the "noise" and how does it help?

The noise is the difference between the true gradient (computed over all \(N\) samples) and the estimated gradient (computed over \(B\) samples). Since the mini-batch is a random sample, the estimated gradient is a random variable — it points roughly in the right direction but not exactly.

Surprisingly, this noise is often beneficial in three ways:

The three variants

VariantGradient Computed OverUpdates per EpochNoise Level
GD (full batch) All \(N\) training samples 1 None — exact gradient
SGD (pure stochastic) 1 random sample \(N\) Very high
Mini-batch SGD \(B\) random samples (\(B \approx 32\text{–}256\)) \(\lfloor N/B\rfloor\) Moderate — best trade-off
Terminology note: In practice, "SGD" usually refers to mini-batch SGD. Pure single-sample SGD is rarely used in modern deep learning. One epoch = one full pass through the training data = \(\lfloor N/B \rfloor\) mini-batch updates.
MATLAB's patternnet: By default it uses full-batch gradient descent with a second-order optimizer (scaled conjugate gradient), which processes all training samples each iteration. This is appropriate for our small datasets (hundreds of samples). For the neural network literature and large-scale deep learning, mini-batch SGD is standard.

⚙️ The Learning Rate

The learning rate \(\alpha\) controls the step size at each parameter update:

\[ \mathbf{W} \leftarrow \mathbf{W} - \alpha\,\frac{\partial L}{\partial \mathbf{W}} \]

This is exactly the gradient descent update rule from Lesson 30 (where we called it \(\gamma\)). The same trade-offs apply — but they are amplified in neural networks because the loss surface is high-dimensional, non-convex, and full of saddle points and narrow valleys.

Four regimes

0 10 20 30 40 50 0 0.5 1.0 1.5 2.0 2.5 Epoch Training Loss ↑ diverges Too small (α ~ 10⁻³) Good (α ~ 5×10⁻²) Too large (α ~ 0.5) Diverges (α ~ 2)
Figure 1. Effect of learning rate on training loss. A good learning rate gives a smooth, rapid decrease (green). Too small converges but extremely slowly (blue). Too large causes oscillation (orange) or divergence (red).

Momentum

Standard gradient descent treats each update independently. Gradient descent with momentum accumulates a moving average of past gradients — called the velocity — and adds it to the current update. This damps oscillations (the updates are smoother) and builds speed along consistent gradient directions (like a ball rolling downhill and gaining inertia):

\[ \mathbf{v} \leftarrow \mu\,\mathbf{v} + \alpha\,\nabla_{\mathbf{W}} L, \qquad \mathbf{W} \leftarrow \mathbf{W} - \mathbf{v} \]

where \(\mathbf{v}\) is the velocity vector (same dimensions as \(\mathbf{W}\)), initialized to zero and updated each iteration. The momentum coefficient \(\mu \in [0,1)\), typically set to \(0.9\), controls how much of the previous velocity is retained. A higher \(\mu\) gives more "inertia." In MATLAB, traingdm implements gradient descent with momentum.

Adaptive learning rates

A fixed learning rate requires careful tuning: what works for early training (when gradients are large) may be too large or too small later. Adaptive optimizers automatically adjust the learning rate for each parameter based on its gradient history:

OptimizerKey IdeaMATLAB
GD with momentum Accumulate gradient history as velocity; damp oscillations traingdm
RMSprop Divide learning rate by a running average of squared gradients; parameters with large gradients get smaller steps trainNetwork only
Adam Combines momentum (first moment) and RMSprop (second moment); widely used default for deep learning trainNetwork with sgdmOptions / adamOptions
SCG (default in patternnet) Second-order method; uses curvature information to select step size automatically; no learning rate to tune trainscg (default)
Practical guideline: For small datasets (hundreds of samples), trainscg (MATLAB's default) is often the best choice because it selects its step size automatically using second-order curvature information. For large datasets or deep architectures, Adam is the most common choice. Manual learning-rate tuning with traingdm is useful for understanding the effect of \(\alpha\), which is the focus of HW 36.

📈 Training & Validation Curves

Monitoring the loss during training is how you diagnose whether your network is learning well. The key principle: split your data into training and validation sets, then track the loss on both throughout training. MATLAB's patternnet does this automatically.

How patternnet splits data

By default, patternnet randomly divides all input data into three subsets before training:

Click Performance in the training GUI to see the loss on all three sets vs. epoch.

Four diagnostic patterns

Underfitting Both losses high & flat Good Generalization Both low & tracking closely best val Overfitting Onset Val. rises after best epoch large gap Severe Overfitting Train low, val remains high Training loss Validation loss
Figure 2. Four diagnostic patterns in the performance plot. Underfitting: model too simple. Good generalization: both curves low and close. Overfitting onset: validation rises after a minimum — stop here. Severe overfitting: training near zero but validation stays high throughout.

Interpreting the patterns

PanelPatternDiagnosisRemedy
1 Both curves high and flat Underfitting — model too simple or learning rate too small Add neurons/layers; increase learning rate; train longer
2 Both curves low and tracking closely Good generalization — model fits well No action needed
3 Val. loss decreases then rises (U-shape) Overfitting onset — model has reached its generalization limit Use early stopping; restore weights at the validation minimum
4 Training near zero, validation stays high throughout Severe overfitting — large persistent gap; model memorizes training data Reduce model size; add more data; add regularization

Early stopping

Early stopping halts training when the validation loss has not improved for a specified number of consecutive epochs, then rolls back the weights to the best validation epoch. In MATLAB:

net.trainParam.max_fail = 6;   % stop after 6 consecutive epochs without
                               % improvement in validation loss (default)
Why early stopping is powerful: It is automatic (no extra hyperparameters to tune), free (no extra computation), and directly prevents the overfitting pattern shown in Figure 2. When max_fail triggers, MATLAB restores the weights from the best validation epoch — the dashed red line in Panel 3.

Recommended Videos

3Blue1Brown's two-part backpropagation series. Watch Ch. 3 first for intuition, then Ch. 4 for the full calculus derivation:

💻 Worked Example — Handwritten Digit Recognition

We train a shallow network on MATLAB's built-in DigitDataset (10,000 handwritten digit images, classes 0–9, 28×28 pixels). Rather than patternnet, we use the Deep Learning Toolbox layer API directly — the same approach you will use in HW 36 and beyond, giving you full control over architecture, activation functions, and optimizer settings.

Three parameters at the top of the script control the experiment: n_per_class (dataset size), n_hidden (network capacity), and learn_rate (optimizer step size). The defaults are chosen to produce poor results.. Try adjusting them to see how each one affects the Training Progress curves before tackling HW 36.

%% Lesson 36 — Handwritten Digit Classification (Deep Learning Toolbox)
% Dataset: MATLAB's built-in DigitDataset (10,000 images, 10 classes, 28x28)
% Network: 784 → n_hidden (ReLU + Dropout) → 10 (softmax)
% HW 36: change n_per_class and n_hidden to explore underfitting/overfitting.

%% ── Parameters to experiment with ──────────────────────────────────────────
n_per_class = 100;    % images loaded per digit class  (max 1000; try 50–500)
n_hidden    = 10;     % neurons in the hidden layer     (try 10, 64, 256, 512)
learn_rate  = 0.05;   % Adam learning rate (default 0.001; high → overfits)
%% ─────────────────────────────────────────────────────────────────────────

%% 1. Load dataset
digitPath = fullfile(matlabroot,'toolbox','nnet','nndemos', ...
                     'nndatasets','DigitDataset');
imds = imageDatastore(digitPath, ...
    'IncludeSubfolders', true, 'LabelSource', 'foldernames');
rng(356);
imds_sub = splitEachLabel(imds, n_per_class, 'randomize');
fprintf('Loaded %d images (%d per class)\n', numel(imds_sub.Files), n_per_class);

%% 2. Split into training (80%) and test (20%)
[imds_train, imds_test] = splitEachLabel(imds_sub, 0.8, 'randomize');

%% 3. Flatten images: each 28x28 → 784-element row vector
%  featureInputLayer expects N×features (rows = observations)
imgs_train = readall(imds_train);
v_train    = cellfun(@(I) double(I(:))/255, imgs_train, 'UniformOutput', false);
X_train    = [v_train{:}]';          % (0.8*n_per_class*10) × 784

imgs_test  = readall(imds_test);
v_test     = cellfun(@(I) double(I(:))/255, imgs_test,  'UniformOutput', false);
X_test     = [v_test{:}]';           % (0.2*n_per_class*10) × 784

Y_train = imds_train.Labels;         % categorical
Y_test  = imds_test.Labels;          % categorical

%% 4. Define the network architecture (layer-by-layer)
layers = [
    featureInputLayer(784,        'Name', 'input')
    fullyConnectedLayer(n_hidden, 'Name', 'fc1')
    reluLayer(                    'Name', 'relu1')
    dropoutLayer(0.3,             'Name', 'drop1')   % randomly zero 30% of
    fullyConnectedLayer(10,       'Name', 'fc2')     %   activations each batch
    softmaxLayer(                 'Name', 'softmax')
    classificationLayer(          'Name', 'output')
];

%% 5. Set training options
opts = trainingOptions('adam', ...           % adaptive learning rate
    'MaxEpochs',           50, ...
    'MiniBatchSize',       64, ...
    'InitialLearnRate',    learn_rate, ...
    'ValidationData',      {X_test, Y_test}, ...
    'ValidationFrequency', 5, ...
    'ValidationPatience',  10, ...           % early stopping
    'Plots',               'training-progress', ...
    'Verbose',             false);

%% 6. Train
net = trainNetwork(X_train, Y_train, layers, opts);
% → Watch the Training Progress window.
% → With these defaults (lr=0.05, n_hidden=10, n_per_class=100) you should
%   see overfitting: training accuracy climbs while validation lags or drops.
%   Lower learn_rate to 0.001 and raise n_per_class / n_hidden to improve.

%% 7. Evaluate on test set
Y_pred = classify(net, X_test);
acc    = 100 * mean(Y_pred == Y_test);
fprintf('n_per_class=%d  n_hidden=%d  lr=%.4f  Test acc: %.1f%%\n', ...
    n_per_class, n_hidden, learn_rate, acc);

%% 8. Visualize 25 random test examples in a 5×5 grid
%  Note: test images are stored in class order, so taking the first 25
%  would show mostly "0"s.  Use randperm to get a representative mix.
idx = randperm(numel(imgs_test), 25);
figure('Name', 'Test Set Predictions', 'NumberTitle', 'off');
for k = 1:25
    subplot(5,5,k);
    imshow(imgs_test{idx(k)});
    pred_str = char(Y_pred(idx(k)));
    true_str = char(Y_test(idx(k)));
    if strcmp(pred_str, true_str)
        title(sprintf('Pred: %s', pred_str), 'Color', 'g', 'FontSize', 11);
    else
        title(sprintf('Pred: %s (T:%s)', pred_str, true_str), ...
              'Color', 'r', 'FontSize', 11);
    end
end
sgtitle('25 Random Test Predictions  (green = correct  |  red = wrong)');
What does dropoutLayer(0.3) do? During each training mini-batch, dropout randomly sets 30% of the activations in that layer to zero — a different random subset each time. This forces the network to learn redundant, distributed representations rather than relying on a small number of neurons, which reduces overfitting. At test time dropout is automatically disabled and all activations are used (scaled to account for the dropped fraction during training). Dropout is one of the most widely used regularization techniques in deep learning.
Why flatten instead of using raw images? A fully-connected layer treats each pixel as an independent feature — it does not exploit the spatial structure of the image. Convolutional Neural Networks (Lesson 37) solve this by applying shared filters that detect local patterns regardless of position.

📝 Summary

ConceptKey Idea
BackpropagationChain rule applied layer-by-layer backward; computes \(\partial L/\partial \mathbf{W}^{[\ell]}\) for every layer efficiently
L2 output delta\(\delta^{[L]} = \hat{y} - y\) for linear output + MSE loss
Hidden delta\(\delta^{[\ell]} = (\mathbf{W}^{[\ell+1]})^\top\delta^{[\ell+1]} \odot \phi'(\mathbf{z}^{[\ell]})\)
Mini-batch SGDUse \(B\) random samples per update; noise improves speed, helps escape local minima, acts as regularization
Learning rate \(\alpha\)Too small → slow; too large → oscillates or diverges; adaptive methods (Adam, SCG) remove the need to tune
MomentumVelocity \(\mathbf{v}\) accumulates past gradients; damps oscillations and builds speed in consistent directions
Training & validation curvesUnderfitting: both high; good fit: both low and close; overfitting: val. rises while train. falls
Early stoppingHalt when val. loss stops improving; restore best weights; prevents overfitting automatically
📋 HW 36 — Training Parameter Exploration (DigitDataset)

📚 References

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Ch. 6.5 (Backprop) & Ch. 8 (Optimization). MIT Press.
  2. 3Blue1Brown. (2017). What is backpropagation really doing? [Video]. YouTube. youtube.com
  3. 3Blue1Brown. (2017). Backpropagation calculus [Video]. YouTube. youtube.com
  4. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
  5. Prechelt, L. (1998). Early stopping — but when? In Neural Networks: Tricks of the Trade, Springer, pp. 55–69.
  6. Kingma, D. P. & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.