Tensorflow and deep learning without a PhD by Martin Görner

My notes from Martin Görner’s Youtube video talk at Devoxx about neural networks and Tensorflow.

3-hour course (video + slides) offers developers a quick introduction to deep-learning fundamentals, with some TensorFlow thrown into the bargain. More info

Part 1 - slides

Softmax - a good activation function for multi-class logistic regression

Y = softmax(X * W +b )
b biases; W weights; X images in arrays; softmax applied line by line; Y predictions
Hidden layers: ReLu outperforms sigmoid
Activation functions - external link
Loss function

For classification problems, cross entropy works a bit better
Optimization

Gradient descent performs better with batches, points better towards lower value
If the accuracy curve is noisy, jumping by 1%, it means you go too fast
Start with more significant decent value first ex 0.003, decrease later to 0.0001
Epoch - you see all your data (all batches) once
Overfitting

Overfitting happens when you have too much freedom when you have too many weights and biases, and you store your training data there in some form
Once model works great with training, fails miserably once it faces test data
If cross-entropy loss graph is strange, starts increasing slowly, there is potential overfitting
Good solution for over-fitting is regularization: dropout
Dropout removes part of the neurons above a specific threshold pKeep = 0.75
CNN Convolutional Neural Networks

Good for 2d representations
With the previous example, we used 1d matrix for image pixels, losing shape information

Part 2 - slides

Batch normalization

The intention behind batch normalization is to optimize network training
The idea is to normalize the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one.
Batch normalization happens before activation function
When you use batch normalization, bias is no longer needed
RNN Recurrent Neural Network

Good for long sequences, for example writing the next word
RNNs are always very deep
LSTM Long Short Term Memory networks

Tracking long-term dependencies
GRU Gated Recurrent Unit networks

GRUs are improved version of the standard recurrent neural network
The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction
The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control