A blog on my understanding of LSTM,GRU, Attention(s0,h1,h2,h3)
LSTM (Long Short Term Memory) Networks
LSTMs are special type of RNNs used for learning long term dependencies. RNNs cannot remember long information , so LSTMs have forget gate(sigmoid layer) which is used to decide which information to remember and forget in the context(cell state).Basically it operates between the extremes(0 to 1). So the value will be multiplied between 0 and 1 , inorder to forget the information in the context and 1 if it wants to remember.
Step 1
The first step decides what information is going to get throwed away from the cell state. As mentioned above earlier , forget layer(sigmoid layer / sigmoid function) is used . It looks at h_t-1 and x_t and outputs between 0 and 1 . 0 — forget and 1 — retain.
f_t = sigmoid(W_f*[h_t-1,x_t]_b_f)
Step 2
This step decides what information is going to get stored in the cell-state(context).First a sigmoid layer(input gate layer) decides which values to update then the tanh layer creates new candidate value Ĉ_t that could be added to the state,tanh layer is used to squish the output between -1 to 1.
i_t = sigmoid(Wi*[h_t-1,x_t]+bi)
Ĉ_t = tanh(W_c*[h_t-1,x_t]+b_c)
Step 3
The old cell state(context) Ct−1 is updated into the new cell state C_t.And then we multiply it by f_t forgetting the things which we decided to forget earlier.The outputs from the previous step(input gate layer and the tanh layer) is added.
C_t = f_t * C_t-1 + i_t * Ĉ_t
Step 4
In this final step , we need to decide what we are going to output. First the sigmoid is used to decide which part of the cell state we’re going to output and then we use tanh function to manipulate the context by squishing the values between -1 to 1 and then multiplying it by the output of the sigmoid gate.
o_t = sigmoid(W_o[h_t-1,x_t]+b_ o)
h_t = o_t * tanh(C_t)
GRU ( Gated Recurrent Unit)
- GRU is just a variant of LSTM.
- Forget and Input gates are combined into a single gate(update gate)
- Cell State and Hidden states are merged.
- Since GRU has 1 gate less than LSTM , it trains a bit faster than the LSTM
Z_t = sigmoid(W_z * [h_t-1,x_t])
r_t = sigmoid(W_r * [h_t-1,x_t])
ĥ_t = tanh(W * [r_t * h_t-1 , x_t ])
h_t = (1 — Z_t) * (h_t-1 + z_t * ĥ_t
Attention([s0, h1, h2, h3])
The basic idea here is to give attention to some input words i.e each time the model predicts an output word, it only uses parts of an input where the most relevant information is concentrated instead of an entire sentence.The context vectors are computed as a weighted sum of the annotations generated by the encoder.
In the above image, s0,h1,h2,h3 are important for the attention layer to look at and decide.