Thank you for sharing this awesome post! I really enjoyed reading it and learned a ton. I wonder what if we use another decoder architecture that first decodes the encoder, then use attention to output a sequence of attended decoding vectors, last encode with some RNN? I think it’s similar idea and it’s easier to code. Correct me if I am wrong. Thanks!