Text search, challenges and solutions
The following article is the 1st from 3-part series focused on text search. Stay tuned for more. In this we will go through the general problem of a text search and solutions that can be leveraged.
You are not the only one searching
Do you sometimes wonder: where did I put it? Or when trying to remember something, do you have this thought: where did I see this? This is our life in the 21st century. Everything we know comes from the internet: what is the weather like — check online, what was the result of yesterday’s football game — check online, where to buy pink trousers — buy online. The internet is an information highway with almost no limits. How does this relate to your business?
Maybe you manage a portal and your customers have a problem. How to find information and buy your products, in other words why choose you and not this guy across the street.
Maybe you manage knowledge in your company or your company’s goals are data-related and you would like to improve the efficiency of the workforce. Good customer service is one way to improve your business. It’s also connected with using text resources as information holders.
This series of articles is designed to help you understand the non-triviality of search challenges and solutions that are available when facing such challenges. There is a major player in this battle, we are going to see how it has developed over the years and what solutions Elastic now has for this problem. What is its ecosystem and how do you manage it?
Taming your text is also very important, it is the first step in applying Machine Learning to text. Features such as content recommendations require correct preprocessing of the text. Before any text can be transferred to the machine learning algorithm, it must be converted into a vector or matrix and it has to be done with correct text preprocessing techniques.
This article focuses on the search challenge and describes the main competitors in the market.
Good searching is what matters
A good search strategy is essential if you manage data either for your in-house operations or if you want to provide a good search experience for customers looking to buy a pair of trousers. Here are some reasons why you should consider improving your search text.
According to some studies (McKinsey), an employee spends about 20% of their time searching for information. If we spend so much time searching, it means it is important to our work.
Some companies only make money by hosting large search solutions, and Google is a leader in this field. They started as a search engine, and nowadays they host a multi-functional cloud that is the third solution in the competition of cloud service providers behind Microsoft and Amazon.
There are more and more people using the internet and producing data as a result. At the beginning of 2021, we had 4.8 billion active internet users, according to Statista. As reported by Internet Live Stats we send around 40,000 search queries per second and tweet 6,000 times per second as you can see at the beginning of 2021. It produces a huge amount of text that needs to be processed and found before it can be used.
Finding is hard
If you are like us, you probably provide solutions that should meet the needs of your customers. But perhaps the search results are of poor quality from time to time. Why is this happening? Because often:
- the text is located in different places of an ecosystem (f.e. ticketing system, knowledge sharing tools, raw files)
- information is written in different languages
- you are not sure how to form a query
- documents you search for come in different formats (PDF / Word) or are stored in different portals (Jira / Service Now / Google Drive)
Even if you are presented with the problem of searching for a text stored in one place, you may face some challenges:
- you need to deal with jargon, a language specific to certain groups
- synonyms and words come in different grammatical forms but have almost the same meaning
- the text is encoded in various ways (f.e. UTF-8, Windows-1250)
As engineers working with text, we should find a way to ensure a good search experience. Sometimes the choice can be simple and we can have a quick win by providing just a simple search solution like web input with a search button.
Of course, it is not only about the user interface, but this is what matters. In order not to leave my words unfounded: think about how Google started. One of the reasons behind the great success of this company is the fact that they provide their customers with a simple GUI. To support this, there are tons of backend services that work together and provide relevant information.
When building a solution that should be a bit “intelligent”, there are various problems you should be aware of. Answers are provided by a set of NLP tools that are available on the market, such as NLTK, SpaCy, GenSim. When building a text analysis system, keep the following in mind:
- we need to understand grammar well to identify words that matter
- we should be able to find boundaries of words and phrases
- in some languages, simple tasks can be challenging (f.e. detecting words boundaries in Chinese)
Unless we correctly identify the token, we are not able to design a trustworthy Machine Learning solution, according to the well-known Garbage In — Garbage Out principle.
BYO — Bring your own software
In the world of IT solutions, various companies try to tackle IT problems using their custom proprietary solutions. It can produce good results. For others, it can become a stumbling block.
First of all, this is how big companies started their business — f.e. Apache Spark was to be the custom solution for the area where Hadoop lacked support. But facing the facts, more often we can see the other side of the coin. Producing custom software results in bad performance and ill-architected solutions that are hard to maintain.
If you want to provide one tool where you put all the knowledge from your team, you could think first about some database with a dedicated GUI. If you aim for having a simple solution and advanced text search, it may be your sweet spot. Don’t forget to think beforehand to integrate different sources into one database.
To start, you need to have a client or a connector that reads data from a source and writes it to the one source of truth — the database. We should be aware of limitations, f.e. connectors require updates when the API of ingested systems changes.
Which database to choose to store textual data? The topic of choosing a good data system is beyond the scope of this series, however, one of the first choices is NoSQL or SQL. Both solutions have their pros and cons.
The graphical user interface is another step in this consideration. Some decide to write their own software based on some popular frontend frameworks, f.e. React, Angular, etc. A must-have in presenting a set of textual documents is to provide a good search experience. The as-you-type search mechanism is a good example.
To implement a backend we have different libraries solving text search and analysis. Having in mind only the Python ecosystem you may feel like you are prepared to tame the text. For instance, we can easily solve issues like tokenization, resulting from the NLTK library.
Elastic is the most popular search engine. It all started with good food — it started with Shay Banon’s project for storing and searching recipes — Compass in 2000. It was later known as Elasticsearch. It was always the default go-to in search solutions as the core of it was (and still is) Lucene — Java library for full-text search. It brought some good stuff to the table as it served as JSON document storage exposed as a REST service. It was easy to start with, and there were many good clients written, f.e. elasticsearch-py. (https://elasticsearch-py.readthedocs.io/). And what is interesting, all of that is just open-source, so you can clearly see and even improve the way it works.
It quickly developed, and it became obvious that we need more products than just plain document storage with full-text search capabilities. That is why in 2012 ELK was created, as it nicely encompassed fetching (Logstash), storage (Elasticsearch), and presentation layer(Kibana). For many years, Elastic was chosen because of this solution to analyze and parse logs. Later on, more lightweight clients for parsing data were introduced as its Ruby predecessor was pretty heavy. That was the time for the family of Beats to be introduced in 2015.
Nowadays, Elastic is a company offering a full stack of its products (called Elastic, BTW) and providing SAAS called elastic.co where you can run and host solutions without the need to take care of underlying infrastructure. Some products make 2 major use cases easier to implement.
Enterprise Search is a solution that provides both of these functionalities. It contains Workplace Search which is a way to integrate and bring together different document sources. AppSearch is a solution that gives you a way to build an e-commerce solution where your main priority is to give customers results tailored to their needs.
Choosing Elastic stack we have to be aware of its limitations. This is based on Java, so it is very resource-greedy. The advantage of this solution is that Elastic is completely customizable, and you have full power over your data. If speed and easy customization is your choice, you should check the next section.
To sum it up:
- Elastic shines where you need to have full customization over deployment and data storage
- It requires more technical knowledge to set it up
Algolia was created to satisfy the needs of fast results for e-commerce. And as opposed to its Java-based rival it has a low entry point. All you need to start is to create an index and push/send documents to the endpoint they expose for you.
Having your documents indexed you are given the chance to customize your clients’ feel and look. There is full control over results, sort, and order. It is easy to create special offers and promote products in specific areas, times, or for a specific persona.
As opposed to Elasticsearch, it works purely as SAAS, and you have no control over the place where your data is stored. What is more, you cannot self-host it and manage it in an isolated environment.
The main advantage of this solution is that it doesn’t require hiring dozens of people just to keep on with the development and management of search solutions. More than that, the quality of the apps it hosts is really high, so f.e. autocomplete and intelligent hints are available out of the box. Your customers who are familiar with a good search experience in such portals like Google, Youtube, Amazon, etc. can have a similar experience leading to a good overall score of the company.
- Algolia is great when you are searching for a tool that just works
- no specific setup is required to start running it
- it is excellent for online marketing and e-commerce
There are other search tools as well, so you might want to take a look at what they offer. We have Hawksearch, Swifttype, Searchspring, and others, some of them being mostly e-commerce oriented or having some roots in Lucene or Elasticsearch.
Search is not a simple problem. It requires a vast understanding of what the written word is and in which forms it exists in our lives. Before choosing the right tool in this domain, you should thoroughly understand your needs. The following articles of the series will help you to develop a solution based on Elastic.