Great article! It’s weirdly comforting to hear that my team weren’t the only ones who ran into these kinds of issues with AWS-hosted Elasticsearch.
The issues you describe were exactly why we decided to bite the bullet and migrate to a cluster we managed ourselves, just using stock ES 2.4 on EC2 boxes. The up-front investment in engineering and ops time was substantial, but it’s paid itself back in spades, not least because we can actually run
script_score queries. Our median query time on a cluster with the same number and type of nodes has also dropped by literally half — a massive gain for our system, which has strict latency requirements and needs near-constant uptime.
Today, if you were to ask me whether it was worth it at all to use AWS-hosted Elasticsearch, I would have to say no. As someone who works at a startup and wears both a distributed-systems dev hat and a devops hat with some regularity, the only thing worse than an operational headache is an operational headache that you can’t debug, because you not only don’t own the system but are specifically, deliberately barred from gaining any insight into it. (Er. Right. /rant)