Let’s dive in. A standard dense layer assumes no temporal order. It doesn't know that the word following "I ate" is likely food-related, or that yesterday's stock price influences today's. RNNs solve this with a hidden state — a vector that gets passed from one time step to the next. The Simple RNN (Vanilla RNN) The simplest form has a loop. At each time step t , it takes the current input x_t and the previous hidden state h_t-1 , and produces a new hidden state h_t .
Vanilla RNNs suffer from the vanishing/exploding gradient problem — they can't learn long-range dependencies (e.g., information from 50 steps ago). This is where LSTM and GRU come in. LSTM (Long Short-Term Memory) LSTMs introduce a cell state (a conveyor belt of information) and three gates: forget, input, and output. These gates learn what to remember, what to write, and what to output.
Recurrent Neural Networks (RNNs) are the powerhouse behind most modern breakthroughs in sequence data—think speech recognition, machine translation, time series forecasting, and even music generation. While standard neural networks treat each input as independent, RNNs have a "memory" that captures information from previous steps. Let’s dive in
h_t = T.tanh(T.dot(x_t, W_xh) + T.dot(h_prev, W_hh) + b_h)
They can remember information for hundreds of steps, making them ideal for text generation, speech recognition, and complex time series. GRU (Gated Recurrent Unit) GRUs are a simpler, faster alternative to LSTMs. They merge the forget and input gates into a single "update gate" and combine the cell state with the hidden state. GRUs perform similarly to LSTMs on many tasks but with fewer parameters. RNNs solve this with a hidden state —
from keras.models import Sequential from keras.layers import LSTM, GRU, SimpleRNN, Dense, Embedding from keras.preprocessing import sequence max_features = 20000 maxlen = 100 # truncate reviews to 100 words batch_size = 32 Build model model = Sequential() model.add(Embedding(max_features, 128, input_length=maxlen)) model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) # or GRU(128) model.add(Dense(1, activation='sigmoid')) Compile (Theano backend) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) Train model.fit(x_train, y_train, batch_size=batch_size, epochs=5, validation_data=(x_val, y_val))
h_t = tanh(W_x * x_t + W_h * h_t-1 + b)
In Python (with Theano-style tensors), a naive implementation looks like: