- Parametrized Models

- Symbols – similar to Factor Graphs

- Bubbles

- Black = observed variables

- Blue = computed variable

- Round blue shape

- Direction == easy to compute in this dir

- Deterministic functions

- Red square

- Cost function

- single scalar output

- Loss Function

- Minimization by gradient based methods

- Can easily find the gradient of a function

- function is differentiable

- almost everywhere

- should be continuous

- can have kinks

- Gradient Descent

- There are algorithms that aren't gradient based

- staircase type

- don't know a function / can't get a gradient

- zero'th order methods / gradient free methods

- whole family of these methods

- used in reinforcement learning

- where the cost isn't differentiable

- (cost becomes a black box)

- can apply gradient estimation

- very inefficient for high dimensions with a huge space to search

- Can have a critic method Actor Critic/AAC/etc.

- By training a "C" module that
*is* differentiable to estimate the cost function

- Reward is negative of a cost

- For batches, roughly use number of categories (or 2x) for batch size

- Neural Nets

- Backprop

- Pytorch

- import nn from torch

- make a class fo the net (nn.Module)

- Linear layers

- Chain rule for vector functions

- Jacobian Matrix

- Can turn a graph into a graph that computes the gradients to backpropagate the gradient

- Can be very complex if the architecture is data dependent

- Modules used in neural nets

- used because they're optimized

- Linear: Y = W.X

- ReLU: y = ReLU(x)

- Duplicate: y1 = x ; y2 = x

- Used when wire splits into two

- Add: y = x1 + x2

- Max: y = max(x1, x2)

- LogSoftMax: y = x
_{i} - log(sum_{j} e^{xj} )

- Softmax

- Sigmoid used with asymptotes doesn't work very well for classification

- sigmoid at its extremes is very small because sigma is flat

- this leads to the saturation problem

- Solutions

- Set targets in between instead of 1/0 (eg. .8 and .2)

- Or take the log of it

- Taking the log of the sigmoid

- S - log(1 + e
^{S} )

- large S ~ S

- small S is dominated by log

- doesn't saturate! – no vanishing gradients

- Tricks

- Use ReLU non linearities – works well for many layers (scaling invariant)

- Cross entropy loss – log softmax is a simpler special case

- Stochastic gradient on minibatches

- Shuffle the training samples

- Otherwise the last layer just learns the current type of input

- Normalize inputs to 0 mean and unit variance

- Can use it on rgb as well

- the channels have very different means

- Schedule a decrease of the learning rate

- Dropout regularization

- l2 -> at every update, weight decay

- L = C() + (alpha) * R(w); R(w) = ||w||
^{2}

- Leads to shrinking the weight at every iteration

- l1 -> R(z) = sum over i |w
_{i} |

- "lasso"

- least absolute shrinkage and selection operator

- Additional references

- Efficient Backprop

- Tricks neural network

- Any directed acyclic graph is ok for backprop

- Lab

- Neural networks are rotation and squishing

- Draw inputs at the bottom

- Having a high dimension intermediate representation is very helpful or simply have more hidden layers

- Because the number of connections grow significantly

- Logit output of final layer

- loss is cross entropy / negative log likelihood

- Choice of activation function is very important

- bunch of networks with initial values – get variance to understand uncertainty in predictions