deeplearning.ai: Improving Deep Neural Networks
my notes on part 2 of the deeplearning.ai course
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Course Resources
 Discussion forum
 YouTube playlist for Course 1
 course reviews: 1
Week 1: Practical aspects of Deep Learning
Train / Dev / Test sets
 training is a highly iterative process as you test ideas and parameters
 even highly skilled practicioners find it impossible to predict the best parameters for a problem, so going through
 split data into train/dev/test sets
 training set  the actual dataset used to train the model
 cross validation or dev set  used to evaluate the model
 test set  only used once the model is finally trained to get a unbiased estimate of the models performance
 depending on the dataset size, use different splits, eg:
 100 to 1M ==> 60/20/20
 1M to INF ==> 98/1/1 or 99.5/0.25/0.25
 you can mismatch train/test distribution, so make sure they come from the same distribution
Bias / Variance
 high bias  model is under fitting, or not even fitting the training set
 high variance  model is over fitting the training and dev set
 you can have both high bias AND high variance
 balance the bias and variance, and also consider the underlying problems tractibility, as are we aiming for 99% accuracy or 70%? Compare with other models and human performance to get a sense of a baseline error
Basic recipe for machine learning
 for high bias errors: bigger NN,more layers, different model, try different activations
 high variance: more data, regularize appropirately, try a different model
 training a bigger NN almost never hurts
Regularization
 vectors often have many dimensions, so model sizes get BIG. regulariation helps to select features and reduce dimensions, as well as reduce overfitting
 L1 regularization (or Lasso) adds a penalty equal to the sum of absoulte coefficients
 L2 regularization (or Ridge) adds a penalty equal to the sum of squared coefficients
L2regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
 as a rule of thumb, L2 almost always works better than L1 .
Dropout regularization
 for each iteration, use a probablity to determine whether to “drop” a neuron. So each iteration through the NN drops a different, random set of neurons.
 each layer in the NN can have a different dropout probability, the downside is that we have more hyperparameters to tweak
 in computer vision, we almost always use dropout becuase we are almost always overfitting in vision problems
 a downside of dropout is that the cost function is no longer well defined, so turn off dropout to see that the loss is dropping, which implies that our model is working, then turn on dropout for better training
 remove dropout at test time
 dropout intuitions:
 can’t rely on on any given neuron, so have to spread out weights
 can work similar to L2 regularization
 helps prevent overfitting
Data augmentation
 inexpensive way to get more data  e.g with images flip, distory, crop, rotate etc to get more images

this helps as a regularization technique
Early stopping
 stop training the NN when the dev set and training set error start diverging  get the huperparameters from the lowest dev and training set cost.
 this prevents the NN from overfitting on the training set but stops gradient descent early
 Andrew NG generally prefers using L2 regularization instead of early stopping as that
Deep NN suffer from vanishing and exploding gradients
 this happens when gradients become very small or big
 in a deep NN if activations are linear, than the activations and derivates will increase exponentially with layers
Weight Initialization for Deep Networks
 this is a paritial solution to vanishing/exploding gradients
 initialize the weights with a variance equal to 1/n  where n is the number of input features  sure weights are not too small or not to large
Numerical approximation of gradients
Gradient Checking

check our gradient computation functions  this helps debug backprop
Week 2:Optimizatoni algorithims
Minibatch gradient descent
 training big data is slow  breaking it up into smaller batches speeds things up
 vectorization allows us to put our entire data set of m examples into a huge matrix and process it all in one go
 but this way we have to process the entire set before our gradient descent can make a small step in the right direction
 training on mini batches allows gradient descent to work much faster, as in one iteration over the dataset we would have taken m / batch_size gradient descent steps
 makes the leaning more ‘noisy’ since each minibatch is new data (compared to going over the entire dataset)
 a typical minibatch size is 64, 128, 256, 512
 should be a power of 2 as thats how computer memory is setup

make sure the batch fits inside cpu/gpu memory
Exponentially weighted averages
 also called exponentially weighted moving averages in statistics
 this decreases the weight of older data exponentially, enhancing the effect of more recent numbers which makes it easy to spot new trends. Of course the parameters can be tweaked, like changing beta depending on how many data points to average
(1 / (1  beta))
.  key component of several optimization algos
V(t) = beta * v(t1) + (1beta) * theta(t)
bias correction in exponentially weighted averages:
 the moving avg in the beginning is low since there isn’t past data to refer from and it starts from zero.
 so add a bias term, dividing the above equation by
(1  beta^t)
:
v(t) = (beta * v(t1) + (1beta) * theta(t)) / (1  beta^t)
Gradient descent with momentum
 modify gradient descent so on each iteration compute the exponential weighted averages of the gradients and update weights
 this takes us faster to the min point, and dampens out oscillations
 most common value of beta is
0.9
. generally we don’t bother with bias correction since we do so many iterations  this explains why momentum works:
With Stochastic Gradient Descent we don’t compute the exact derivate of our loss function. Instead, we’re estimating it on a small batch. Which means we’re not always going in the optimal direction, because our derivatives are ‘noisy’. Just like in my graphs above. So, exponentially weighed averages can provide us a better estimate which is closer to the actual derivate than our noisy calculations.
 also see Nesterov Momentum
RMSprop or Root mean square prop
 we want learning to go fast horizontally and slower vertically  so we divide updates in the vertical direction by a large number and updates in the horizontal direction by a much smaller number
 this dampens oscillations, so we can use a faster learning rate
 first proposed in Hinton’s coursera course
Adam optimization algorithim
 mashes together momemtum and RMSprop, works very well
 parameters:
 learning rate alpha
 beta1: moving avg or momemtum parameter, same as 0.9 above
 beta2: RMSprop,
0.999
works well  epsilon
10^8
 less important, can leave it at default  generally leave values at default, just try out different learning rates
Learning rate decay
 slowly decrease learning rate over epochs
 this makes intuitive sense as in the beginning bigger steps are ok and as the NN starts converging, we need smaller steps
 other methods: exponential decay, discrete steps, etc
 manual decay  watch the model as it trains, pause and manually change the learning rate  works if running a small number of models which take a long time to train
 this is lower down on the list of things to try when tuning
The problem of local optima
 people used to worry a lot about NN getting in local optima, but in multidimensional space most points of zero gradient are saddle points so its very unlikely to get stuck in a local optima
 a lot of our intuitions about low dimensional spaces don’t transfer over the high dimensional space practically all NN’s use  i.e if we have 20K parameters, the NN is operating in a 20K dimensional space
 but plateaus can slow down learning, so techniques like momentum, Adam help here
Week 3: Hyperparameter tuning, Batch Normalization and Programming Frameworks
Hyperparameter tuning
The Tuning Process
 there are tons of hyperparameters to tune, but some are more important than others
 alpha or the learning rate is the most important parameter most the time
 then the second most important:
 momentum (0.9 being a good default,
 minibatch size
 hidden units
 third in important:
 num of layers learning rate decay
 when using Adam, you pretty much never have to tune beta1, beta2 and epsilon
 of course, this depends on the NN, dataset etc
 don’t use a grid of one val vs the other  this works when you have a small number of hyperparameters, but in Deep Learning choose hyperparameter combinations at random
 use coarse to fine sampling  find the range where a parameter is working, then do finer grained sampling to get the best val
Using an appropriate scale to pick hyperparameters
 pick the appropriate scale for hyperparameters, generally better to use log scale rather than linear
Hyperparameters tuning in practice: Pandas vs. Caviar
 intuitions about hyperparameters often don’t transfer to other domains e.g logistcs, nlpo, vision, speech will all have different best parameters
 two major ways to find the best parameters:
 babysit one model  as its training, tweak the parameters and see how its doing on metrics. This is helpful when we don’t have enough computation capcity (panda approach)
 train many models in parallel  run many models in parallel, with different parameters (caviar approach)
Batch Norm
Normalizing activations in a network
 instead of just normalizing the inputs to a NN, we also normalize the outputs of each layer of a NN  this is called batch norm.
 using the standard
(intermediate val  mean) / (std dev + epsilon)
. Epsilon is needed for numerical stability if variance is zero  batch norm sets the hidden layer to have mean 0 and variance 1, but sometimes we might want it to have a different distribution, so we can add a parameter beta to the hidden units to change the shape of the distribution. (like we might want a larger variance to take advantage of the nonlinearity of the sigmoid function).
 batch norm makes the NN more robust and speeds up learning
 while batch norm can be applied before or after the activation function, in practice it is generally applied before the activation funciton.
Fitting Batch Normalization into a neural network
 Deep learning frameworks have batch norm built in, like tensorflow, keras etc.
 implementing this ourself is straightforward, as we add in a norm step just before applying the activation function at each layer in the NN.
 In each minibatch, for each hidden layer compute the mean and the variance in that minibatch, normalize, then apply the activation function as before.
Why does Batch normalization work?
Three main reaons:
 1: normalizing input features speeds up learning, so one intuition is that this is doing a similar thing for each hidden layer
 2: makes weights in deeper layers more robust to changes in weights in earlier layers
 for example, we train a network on black cats, and we try to classify a coloured cat. Every input is shifted, but the decision boundaries haven’t changed. This has a fancy name, coviariate shift.
 allows each layer to learn more independently
 3: regularization of hidden units
 adds some noise to each minibatch, as each one is regularized on its own mean/variance, which has a slight regularization effect (bigger batches reduce noise/regularization)
 but don’t rely on batch norm for regularization, as its just a unintended side effect, use other techniques like L2 or dropout
Batch normalization at test time
 we often predict one example at a time at test time, so the idea of a mean/variance doesn’t apply  we no longer have a minibatch to process
 we calculate the exponentially weighted average across all the minibatches and use that to get a mean/variance to apply at test time on the one test example
Multiclass Classification
Softmax regresssion
 is generalization of logistic regression which can predict multiple classes rather than just a binary.
Training a Softmax classifier
 a hard max would look at a vector and put 1 for the max and zero for everything else
 soft max takes in a vector and puts in a probability for each value such that they all sum up to 1
 loss function is trying to make the probablity of the “right” class as high as possible
Intro to programming frameworks
Deep learning frameworks
 implementing a basic NN libary is a great learning framework
 as we implemnt complex/large models its not practical implement everything from scratch  there are many good frameworks to choose from
 the DL frameworks not only speed up coding but implement many optimizations
 choose one by looking at programning ease, running speed and how open it is
 my own research has led my to tensorflow/keras or pytorch
 covers the very basics of tensorflow, though the tensorflow guide is better.
 we implement forward prop, tf automatically does the backprop
 a tf program looks like:
 make tensors
 write operations to those tensors
 initialize tensors
 create and run a Session
 tf.placeholder is a variable to which we assign a value later
Yuanqing Lin interview
 heads China’s National Deep Learning Research Lab
 building a really large deep learning platform