Gated recurrent unit (GRU) was introduced by Cho, et al. in 2014 to solve the vanishing gradient problem faced by standard recurrent neural networks (RNN). GRU shares many properties of long short-term memory (LSTM). Both algorithms use a gating mechanism to control the memorization process.
Interestingly, GRU is less complex than LSTM and is significantly faster to compute. In this guide you will be using the Bitcoin Historical Dataset, tracing trends for 60 days to predict the price on the 61st day. If you don't already have a basic knowledge of LSTM, I would recommend reading Understanding LSTM to get a brief idea about the model.
What makes GRU special and more effective than traditional RNN?
GRU supports gating and a hidden state to control the flow of information. To solve the problem that comes up in RNN, GRU uses two gates: the update gate and the reset gate.
You can consider them as two vector entries (0,1) that can perform a convex combination. These combinations decide which hidden state information should be updated (passed) or reset the hidden state whenever needed. Likewise, the network learns to skip irrelevant temporary observations.
LSTM consists of three gates: the input gate, the forget gate, and the output gate. Unlike LSTM, GRU does not have an output gate and combines the input and the forget gate into a single update gate.
Let's learn more about the update and reset gates.
The update gate (z_t) is responsible for determining the amount of previous information (prior time steps) that needs to be passed along the next state. It is an important unit. The below schema shows the arrangement of the update gate.
Here, x_t
is the input vector served in the network unit. It is multiplied by its parameter weight (W_z
) matrices. Thet_1
in h(t_1)
signifies that it holds the information of the previous unit and it's multiplied by its weight. Next, the values from these parameters are added and are passed through the sigmoid activation function. Here, the sigmoid function would generate values between 0 and 1 limit.
the reset gate (r_t
) is used from the model to decide how much of the past information is needed to neglect. The formula is the same as the update gate. There is a difference in their weights and gate usage, which is discussed in the following section. The below schema represents the reset gate.
There are two inputs,x_t
and h_t-1
. Multiply by their weights, apply point-by-point addition, and pass it through sigmoid function.
First, the reset gate stores the relevant information from the past time step into the new memory content. Then it multiplies the input vector and hidden state with their weights. Second, it calculates element-wise multiplication (Hadamard) between the reset gate and previously hidden state multiple. After summing up, the above steps non-linear activation function is applied to results, and it produces h'_t
.
Consider a scenario in which a customer reviews a resort: "It was late at night when I reached here." After a couple of lines, the review ends with, "I enjoyed the stay as the room was comfortable. The staff was friendly." To determine the customer's satisfaction level, you will need the last two lines of the reviews. The model will scan the whole review to the end and assign a reset gate vector value close to '0'.
That means it will neglect the past lines and focus only on the last sentences.
Refer to the illustration below.
This is the last step. In the final memory at the current time step, the network needs to calculate h_t
. Here, the update gate will play a vital role. This vector value will hold information for the current unit and pass it down to the network. It will determine which information to collect from current memory content (h't
) and previous timesteps h(t-1)
. Element-wise multiplication (Hadamard) is applied to the update gate and h(t-1)
, and summing it with the Hadamard product operation between (1-z_t)
and h'(t)
.
Revisiting the example of the resort review: This time the relevant information for prediction is mentioned at the beginning of the text. The model would set the update gate vector value close to 1. At the current time step, 1-z_t
will be close to 0, and it will ignore the chunk of the last part of the review. Refer to the image below.
Following through, you can see z_t
is used to calculate 1-z_t
which, combined with h't
to produce results. Hadamard product operation is carried out between h(t-1)
and z_t
. The output of the product is given as the input to the point-wise addition with h't
to produce the final results in the hidden state.
Note: Refer the code for importing important libraries and data pre-processing from this guide before building the GRU model.
From Keras Layers API, import the GRU
layer class, regularization layer: dropout
and core layer dense
.
In the first layer where the input is of 50 units, return_sequence
is kept true
as it returns the sequence of vectors of dimension 50. The return_sequence
of the next layer would give the single vector of dimension 100.
1from sklearn.metrics import mean_absolute_error
2from keras.models import Sequential
3from keras.layers import Dense, GRU, Dropout
4
5mode = Sequential()
6mode.add(GRU(50, return_sequences=True, input_shape=(x_train.shape[1],1)))
7mode.add(Dropout(0.2))
8mode.add(GRU(100,return_sequences=False))
9mode.add(Dropout(0.2))
10mode.add(Dense(1, activation = "linear"))
11
12mode.compile(loss="mean_squared_error",optimizer="rmsprop")
13
14mode.fit(x_train,y_train,epochs= 50,batch_size=64)
Now test data has prepared for prediction.
1inputs=data[len(data)-len(close_test)-timestep:]
2inputs=inputs.values.reshape(-1,1)
3inputs=scaler.transform(inputs)
1x_test=[]
2for i in range(timestep,inputs.shape[0]):
3 x_test.append(inputs[i-timestep:i,0])
4x_test=np.array(x_test)
5x_test=x_test.reshape(x_test.shape[0],x_test.shape[1],1)
Let's apply the model on test data. Time to predict.
1plt.figure(figsize=(8,4), dpi=80, facecolor='w', edgecolor='k')
2plt.plot(data_test,color="r",label="true-result")
3plt.plot(predicted_data,color="g",label="predicted-result")
4plt.legend()
5plt.title("GRU")
6plt.xlabel("Time(60 days)")
7plt.ylabel("Close Values")
8plt.grid(True)
9plt.show()
1predicted_data=mode.predict(x_test)
2predicted_data=scaler.inverse_transform(predicted_data)
1data_test=np.array(close_test)
2data_test=data_test.reshape(len(data_test),1)
1plt.figure(figsize=(8,4), dpi=80, facecolor='w', edgecolor='k')
2plt.plot(data_test,color="r",label="true-result")
3plt.plot(predicted_data,color="g",label="predicted-result")
4plt.legend()
5plt.title("GRU")
6plt.xlabel("Time(60 days)")
7plt.ylabel("Close Values")
8plt.grid(True)
9plt.show()
To sum up, GRU outperformed traditional RNN. If you compare the results with LSTM, GRU has used fewer tensor operations. It takes less time to train. The results of the two, however, are almost the same. There isn't a clear answer which variant performed better. Usually, you can try both algorithms and conclude which one works better.
This guide was a brief walkthrough of GRU and the gating mechanism it uses to filter and store information. A model doesn't fade information—it keeps the relevant information and passes it down to the next time step, so it avoids the problem of vanishing gradients. LSTM and GRU are state-of-the-art models. If trained carefully, they perform exceptionally well in complex scenarios like speech recognition and synthesis, neural language processing, and deep learning.
For further study, I recommend reading this paper, which clearly explains the distinction between GRU and LSTM.