Using More Like This Query (MLT) — ElasticSearch

André Coelho
5 min readJul 1, 2022

--

A More Like This Query is a feature of ElasticSearch that allows you to find similar documents from an input. It works from a new query built from the relevant terms present in the input.

First of all to test the query we need to create a test index. We’ll use our movies index, access our insert script here.

Now with our index available, let’s get to work!

Let’s Start

More Like This has some parameters, let’s talk a little about them.

Input parameters

  • like: The only required parameter of the MLT query is like and follows a versatile syntax, in which the user can specify free form text and/or a single or multiple documents (see examples above). The syntax to specify documents is similar to the one used by the Multi GET API. When specifying documents, the text is fetched from fields unless overridden in each document request. The text is analyzed by the analyzer at the field, but could also be overridden. The syntax to override the analyzer at the field follows a similar syntax to the per_field_analyzer parameter of the Term Vectors API. Additionally, to provide documents not necessarily present in the index, artificial documents are also supported.
  • unlike: The unlike parameter is used in conjunction with like in order not to select terms found in a chosen set of documents. In other words, we could ask for documents like: “Apple”, but unlike: “cake crumble tree”. The syntax is the same as like.
  • fields: A list of fields to fetch and analyze the text from. Defaults to the index.query.default_field index setting, which has a default value of *. The * value matches all fields eligible for term-level queries, excluding metadata fields.

Term selection parameters

  • max_query_terms: The maximum number of query terms that will be selected. Increasing this value gives greater accuracy at the expense of query execution speed. Defaults to 25.
  • min_term_freq: The minimum term frequency below which the terms will be ignored from the input document. Defaults to 2.
  • min_doc_freq: The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.
  • max_doc_freq: The maximum document frequency above which the terms will be ignored from the input document. This could be useful in order to ignore highly frequent words such as stop words. Defaults to unbounded (Integer.MAX_VALUE, which is 2³¹-1 or 2147483647).
  • min_word_length: The minimum word length below which the terms will be ignored. Defaults to 0.
  • max_word_length: The maximum word length above which the terms will be ignored. Defaults to unbounded (0).
  • stop_words: An array of stop words. Any word in this set is considered “uninteresting” and ignored. If the analyzer allows for stop words, you might want to tell MLT to explicitly ignore them, as for the purposes of document similarity it seems reasonable to assume that “a stop word is never interesting”.
  • analyzer: The analyzer that is used to analyze the free form text. Defaults to the analyzer associated with the first field in fields.

Having presented the parameters, let’s start building our queries.

Our first example will be using MLT with a text input, let’s look in the “title” and “overview” fields. In the “like” parameters we will input the text “heroes” and receive the documents related to that term.

GET /idx_movies/_search
{
"size": 5,
"_source": ["title", "description"],
"query": {
"more_like_this": {
"fields": [
"title",
"description"
],
"like": [
"heroes"
],
"max_query_terms": 10,
"min_term_freq": 1
}
}
}

Note that the results are related to the term.

"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : 4.906305,
"hits" : [
{
"_index" : "idx_movies",
"_type" : "_doc",
"_id" : "CpwMOYEBBR941NShMW7p",
"_score" : 4.906305,
"_source" : {
"description" : "Earth's mightiest heroes must come together and learn to fight as a team if they are to stop the mischievous Loki and his alien army from enslaving humanity.",
"title" : "The Avengers"
}
},
{
"_index" : "idx_movies",
"_type" : "_doc",
"_id" : "GZwMOYEBBR941NShMW7p",
"_score" : 4.835849,
"_source" : {
"description" : "As an Orc horde invades the planet Azeroth using a magic portal, a few human heroes and dissenting Orcs must attempt to stop the true evil behind this war.",
"title" : "Warcraft"
}
},
{
"_index" : "idx_movies",
"_type" : "_doc",
"_id" : "fpwMOYEBBR941NShMW_q",
"_score" : 4.7008367,
"_source" : {
"description" : "The special bond that develops between plus-sized inflatable robot Baymax, and prodigy Hiro Hamada, who team up with a group of friends to form a band of high-tech heroes.",
"title" : "Big Hero 6"
}
},
{
"_index" : "idx_movies",
"_type" : "_doc",
"_id" : "HJwMOYEBBR941NShMW7p",
"_score" : 4.3375387,
"_source" : {
"description" : "When Tony Stark and Bruce Banner try to jump-start a dormant peacekeeping program called Ultron, things go horribly wrong and it's up to Earth's mightiest heroes to stop the villainous Ultron from enacting his terrible plan.",
"title" : "Avengers: Age of Ultron"
}
},
{
"_index" : "idx_movies",
"_type" : "_doc",
"_id" : "8JwMOYEBBR941NShMW3p",
"_score" : 4.2823787,
"_source" : {
"description" : "Three decades after the defeat of the Galactic Empire, a new threat arises. The First Order attempts to rule the galaxy and only a ragtag group of heroes can stop them, along with the help of the Resistance.",
"title" : "Star Wars: Episode VII - The Force Awakens"
}
}
]
}

In the next example we will pass a document as a reference and get similar documents. The interesting thing about this approach is that in addition to the _id of the document, we can say in which index to search, in our case it will be in the “idx_movies” itself.

The _id used will be that of the movie “Divergent” which is N5wMOYEBBR941NShMW7p

GET /idx_movies/_search
{
"_source": ["title"],
"query": {
"more_like_this": {
"fields": [
"title",
"description",
"actors"
],
"like": [
{
"_index": "idx_movies",
"_id": "N5wMOYEBBR941NShMW7p"
}
],
"max_query_terms": 10,
"min_term_freq": 1
}
}
}

Note that the results are entirely related to the movie “Divergent”, the similar documents retrieved are those of the “Divergent” series.

"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 26.424755,
"hits" : [
{
"_index" : "idx_movies",
"_type" : "_doc",
"_id" : "IJwMOYEBBR941NShMW_q",
"_score" : 26.424755,
"_source" : {
"title" : "Insurgent"
}
},
{
"_index" : "idx_movies",
"_type" : "_doc",
"_id" : "ipwMOYEBBR941NShMW7q",
"_score" : 21.644402,
"_source" : {
"title" : "Allegiant"
}
}
]
}

These were the examples about MLT, I hope you like it and don’t forget to access the documentation for more details,

--

--

André Coelho

Developer of web and mobile systems. Enthusiast in the area of ​​automation and electronics and I have hobbie music.