Increase the power of your search engine with Textbox

Published in

Machine Box

3 min readJul 28, 2017

When you index your data for search, one of the things that you have to bare in mind is how to optimise for relevance.

Let’s say we need to search across multiple news articles, and we have only the content and title of the articles, for example the dataset about BBC Sports that we can find on http://mlg.ucd.ie/datasets/bbc.html

On the raw dataset, we get txt files where the first line is the title of the article and the rest is the content. We can easily process the article to get an structure like this:

{
 id: "123",
 title: "Radcliffe will compete in London",
 content: "Paula Radcliffe will compete in the Flora London    Marathon..."
}

With this structure if you put the all the documents into Elastic Search you can have a very decent search engine.

The problem comes when you want to make your queries and visualizations more powerful and relevant. For example, if the user enters the query London they are probably referring to the City of London, and when they enter Radcliffe is referring to a person.

How can I get all the documents from my search engine, filtered by a place or a person? Using Elastic Search you can do a terms query or a terms aggregation, but you will need a field representing places or people.

Pre-process the articles with Textbox

We can pre-process all the articles with Textbox

Textbox processes text and performs natural language processing, sentiment analysis and entity and keyword extraction allowing you to build tools that programatically understand the content of text.

The idea is, before inserting the documents on the search engine we use Textbox to extract the most important keywords of the article, the people, and the places.

We can have a little indexer.go script to do that, using machinebox sdk-go

// omitted error handling for display here
r, _ := os.Open(path) // path of the article.txt
defer r.Close()b, _ := ioutil.ReadAll(r)
text := string(b)resp, _:= textbox.Check(strings.NewReader(text))keywords := []string{}
people := []string{}
places := []string{}for _, k := range resp.Keywords {
   keywords = append(keywords, k.Keyword)
}for _, s := range resp.Sentences {
   for _, ent := range s.Entities {
      if ent.Type == "person" {
         people = append(people, ent.Text)
      }
      if ent.Type == "place" {
         places = append(places, ent.Text)
      }
    }
}// split to get the title and the content
split := strings.SplitN(text, "\n", 2)body := map[string]interface{}{
   "title":    split[0],
   "content":  split[1],
   "keywords": keywords,
   "people":   people,
   "places":   places,
}postES(es, index, path, body)

Once we pre-process the article we can insert a document like this one:

{
 id: “123”,
 title: “Radcliffe will compete in London”,
 content: “Paula Radcliffe will compete in the Flora London Marathon…”,
 keywords: [“race director david bedford”, “25th anniversary”, “…”],
 places: [“London”],
 people: [“Paula Radcliffe”]
}

That allows us to do queries on Elastic Search, by place or by people:

GET news_textbox/_search
{"query": {"term": {"places.keyword": "London"}}}GET news_textbox/_search
{"query": {"term": {"people.keyword": "Paula Radcliffe"}}}

And power nice visualizations with Kibana, for example using the tag cloud: