In the recurrent neural network, to predict the output, instead of using only the current input we also use the previous hidden state. The previous hidden state holds the information about what the network has seen so far whereas in the feedforward network, to predict the output, we use only the current input.
Suppose, we initialize the weights of the network randomly with small values. During backpropagation, we compute the derivative of the hidden layer and multiply them by weights at every step while moving backward.
This derivative and weights, both of which are a small number. When we multiply two numbers which are small then the result will be a smaller number. So, when we multiply the weights and derivative at every step then our gradient becomes an infinitesimally small number and this is called vanishing gradient problem.
We can prevent the vanishing gradient problem by using the ReLu activation function instead of tanh or sigmoid activation. We can also avoid vanishing gradient problem by using a variant of RNN called LSTM.
The sequence-to-sequence model (seq2seq) is the many-to-many RNN architecture. It is widely used in various applications where we need to map an arbitrary-length input sequence to an arbitrary-length output sequence. The example includes music generation, chatbots, and more.