the hidden layer feeds it’s state back into itself — that is, if we have 4 recurrent nodes in the hidden layer, they each have 4 inputs connections (one from each other node + themselves) that represents the hidden state of the previous time-step. That gives us a (state_size x state_size) matrix of connections we need to learn weights for and store the state of.
However, the state of the hidden layer is *also* determined by the inputs. This means we need one more connection into each node, or state_size more connections total. This is where we get the (state_size+1 x state_size) matrix of weights — (state_size x state_size) correspond to the previous time-step of the hidden layer, and (1 x state_size) correspond to the inputs from our input sequence.