Reference 1
Perceptron, 1985.
Activation Functions: non-linearity
two input, 1 output.
multi output.
init weight. –> wrong prediction –> big loss
multiple data –> loss of all predictions
Binary Cross Entropy Loss: for output to be 0 or 1.
Mean Squared Error Loss: for output to be continuous real numbers.
Find the weights $W$ that achieve the least loss.
Gradient Descent:
==> The key: compute the gradient ==> backpropagation
Given a loss, and weight, how do we know which way to move to reach the lowest point of the loss function?
E.G. One in, one hidden, one out learning model.
input x (layer 0) – w1 (layer 1) – w2 (layer2) – output J(W)
How does a small change in one weight $(ex, w_2)$ affects the final loss J(W)?
Dec 2017, Visualizing the loss landscape of neural nets.
Fixed learning rate:
Adaptive learning rate: the size of small changes in weight is adaptive, depending on
Algorithms: 1952 ~ 2014
To compute gradient descent: pick a single point vs all the points vs a set of points: mini-batch.
Mini-batch: much quicker
Regularization.
Deep Sequence Modeling
A ball, where to go next? Need a sequence of position over time.
A sequence modeling problem: Predict the Next Word.
“This morning I took my cat for a ____“. (Walk)
Fixed Window:
-> count words in entire sequence
==> count does not preserve order info
-> A big fixed window:
Standard feed-forward network
Recurrent Neural Network: a feed back loop; can be viewed as multiple sub-networks got connected inside
the RNN network inside have recurrent cell
that can be fed by input as well as the previous output of itself.
TODO
Vision
For computers, images are just numbers.
(LLM: For humans, images are ???.)
Tasks in computer vision: regression, classification.
Features detection.
Dense(fully connected) Neural Network
Convolution:
Input image —> Convolution (filters, feature maps) —> Maxpooling –> fully connected layer as output.
Three main parts:
Convolution operation: applying filters to generate feature maps. one filter one feature.
e.g. tf.keras.layers.Conv2D(filters = d, kernel_size = (h,w), strides = s)
Myth of cove. Find hidden variables even only the observable is given. Finding hidden cause/reason/laws.
Autoencoders/decoders.
Input object –> reconstructed object.
Learning in dynamic environment.
Supervised, unsupervised, vs. reinforcement learning.
TODO….
nitty-gritty 事实真相,本质
If you could revise
the fundmental principles of
computer system design
to improve security...
... what would you change?