Increase the power of your search engine with Textbox

David Hernandez
Machine Box
Published in
3 min readJul 28, 2017

When you index your data for search, one of the things that you have to bare in mind is how to optimise for relevance.

Let’s say we need to search across multiple news articles, and we have only the content and title of the articles, for example the dataset about BBC Sports that we can find on http://mlg.ucd.ie/datasets/bbc.html

On the raw dataset, we get txt files where the first line is the title of the article and the rest is the content. We can easily process the article to get an structure like this:

{
id: "123",
title: "Radcliffe will compete in London",
content: "Paula Radcliffe will compete in the Flora London Marathon..."
}

With this structure if you put the all the documents into Elastic Search you can have a very decent search engine.

The problem comes when you want to make your queries and visualizations more powerful and relevant. For example, if the user enters the query London they are probably referring to the City of London, and when they enter Radcliffe is referring to a person.

How can I get all the documents from my search engine, filtered by a place or a person? Using Elastic Search you can do a terms query or a terms aggregation, but you will need a field representing places or people.

Pre-process the articles with Textbox

We can pre-process all the articles with Textbox

Textbox processes text and performs natural language processing, sentiment analysis and entity and keyword extraction allowing you to build tools that programatically understand the content of text.

The idea is, before inserting the documents on the search engine we use Textbox to extract the most important keywords of the article, the people, and the places.

We can have a little indexer.go script to do that, using machinebox sdk-go

// omitted error handling for display here
r, _ := os.Open(path) // path of the article.txt
defer r.Close()
b, _ := ioutil.ReadAll(r)
text := string(b)
resp, _:= textbox.Check(strings.NewReader(text))keywords := []string{}
people := []string{}
places := []string{}
for _, k := range resp.Keywords {
keywords = append(keywords, k.Keyword)
}
for _, s := range resp.Sentences {
for _, ent := range s.Entities {
if ent.Type == "person" {
people = append(people, ent.Text)
}
if ent.Type == "place" {
places = append(places, ent.Text)
}
}
}
// split to get the title and the content
split := strings.SplitN(text, "\n", 2)
body := map[string]interface{}{
"title": split[0],
"content": split[1],
"keywords": keywords,
"people": people,
"places": places,
}
postES(es, index, path, body)

Once we pre-process the article we can insert a document like this one:

{
id: “123”,
title: “Radcliffe will compete in London”,
content: “Paula Radcliffe will compete in the Flora London Marathon…”,
keywords: [“race director david bedford”, “25th anniversary”, “…”],
places: [“London”],
people: [“Paula Radcliffe”]
}

That allows us to do queries on Elastic Search, by place or by people:

GET news_textbox/_search
{"query": {"term": {"places.keyword": "London"}}}
GET news_textbox/_search
{"query": {"term": {"people.keyword": "Paula Radcliffe"}}}

And power nice visualizations with Kibana, for example using the tag cloud:

And you could use it to have a dashboard about what topics are trendy:

Conclusion:

Writing some Go code, and using Textbox to do NLP can power up your search result and visualization, in just a few lines of code.

You can experiment with this code and visualization is open source on Github: https://github.com/machinebox/textbox_elastic_indexer

Check this other blog post about Textbox.

Try MachineBox boxes to make easy, tasks like NLP, or face recognition.

--

--

David Hernandez
Machine Box

@dahernan Machine Learning and Go. Making @machineboxio