My notes from Martin Görner’s Youtube video talk at Devoxx about neural networks and Tensorflow.
3hour course (video + slides) offers developers a quick introduction to deeplearning fundamentals, with some TensorFlow thrown into the bargain. More info
Part 1  slides

Softmax  a good activation function for multiclass logistic regression
Y = softmax(X * W +b )
b biases; W weights; X images in arrays; softmax applied line by line; Y predictions
Hidden layers: ReLu outperforms sigmoid
Activation functions  external link 
Loss function
For classification problems, cross entropy works a bit better

Optimization
Gradient descent performs better with batches, points better towards lower value
If the accuracy curve is noisy, jumping by 1%, it means you go too fast
Start with more significant decent value first ex 0.003, decrease later to 0.0001
Epoch  you see all your data (all batches) once 
Overfitting
Overfitting happens when you have too much freedom when you have too many weights and biases, and you store your training data there in some form
Once model works great with training, fails miserably once it faces test data
If crossentropy loss graph is strange, starts increasing slowly, there is potential overfitting
Good solution for overfitting is regularization: dropout
Dropout removes part of the neurons above a specific threshold pKeep = 0.75 
CNN Convolutional Neural Networks
Good for 2d representations
With the previous example, we used 1d matrix for image pixels, losing shape information
Part 2  slides

Batch normalization
The intention behind batch normalization is to optimize network training
The idea is to normalize the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one.
Batch normalization happens before activation function
When you use batch normalization, bias is no longer needed 
RNN Recurrent Neural Network
Good for long sequences, for example writing the next word
RNNs are always very deep 
LSTM Long Short Term Memory networks
Tracking longterm dependencies

GRU Gated Recurrent Unit networks
GRUs are improved version of the standard recurrent neural network
The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction
The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control