I’m not sure if I understand your implementation of the “Separate Target Network” correctly. In the paper, the authors use the target network to update the Q-Target values that are then used to for training of the main network. This happens at every step for a given mini batch of samples. The target network is then updated every “U” steps using the main network.
In your implementation, it seems like you do everything at once (all the sampling, training, and updating code is in one block that is executed every “U” steps, see code after:
if total_steps % (update_freq) == 0:
Unless I missed something, there is no training happening at other steps. All what happens at the other steps is that you fill the experience_buffer. Is that really the intended behavior?