Why do you think did the CNN+LSTM perform so bad?

I think ultimately, a network architecture of this type should perform best (or nearly best) as well. However, to get it to train properly likely takes a lot more fine-tuning of parameters and perhaps multiple GPUs to handle the memory required. Definitely worth a lot more exploration to validate.

