Search engine vs Database in BI - 1

Structures are essential for data analytics systems.

5 min readJun 27, 2011

Note: this article was originally published in 2011, before the NoSQL revolution, but we still find it relevant today. One can easily replace the word search engine by NoSQL solution.

BI apps face a real data deluge these days. The engineers need new approaches to deal with this. Is a search engine, well known for handling humongous data sets, adapted to a BI context ?

Business intelligence (BI) is the art of gathering and structuring data to help making decisions. One major technical challenge is that BI generally involves dealing with large volume of data, which cause troubles to the classic database approach. The search engines (SEs) are a new approach to handling data. Is a search engine the future go-to solution for BI usage ?

SEs are particularly good for dealing with large amount of data. This quality is based upon smart architectural solutions to common problems. In essence a SE :

crawls raw data
indexes the different data (classify and interpret)
provides functionalities to look for data related to a theme (search)

To illustrate our talk, we’ll take as an example a BI program that target employee’s emails. All the company’s communications are classified and then one can search for emails and computes indicators on this data.

Structure == contraints

The classification process in BI have to understand the meaning of each email. If I write down this information:

We can interpret in different ways:

Baptiste is not here and notified Pierre; or
Pierre is not here and notified Baptiste; or
Baptiste and Pierre are not here.

Those 3 different interpretations.

Now if the data we received are:

Now we understand without ambiguity what is the related fact to the piece of information: Baptiste said to Pierre that he wasn’t here when he sent the email.

It is due to the new structure elements that we added: from, to.

Ambiguity is removed by adding structure on data.

Structured data like “baptiste.manson@inovia.fr sends to pierre.cornic@inovia.fr the message: I’m not here now.” can be expressed more formally as:

Type: Email
An email must contain a sender.
A sender is of type email address.
An email must contain a recipient.
A recipient is of type email adress as well.
An email may contain a message.
A message is of type text.

In essence, what we called structure can be expressed as constraints.

We tend to prefer the following notation, called a schema:

sender: email address (required)
recipient: email address (required)
message: free text

Those constraints have been at the heart of database development since the 80s, and many see the NoSQL/SEs revolution as an experiment in relaxing those contraints.

Soft constraints

Most Search engines started by developing the core of their features around unstructured data. A pretty important bulk of their performance come from relaxing schema constraints.

Due to some professional requirements, some SEs, like Exalead, Solr or ElasticSearch, decided to later support their own version of constraints.

They are called either soft constraints, business items, or even sometimes schema by analogy to a database schema. Check this article for an example.

Are soft constraints good enough for BI?

Attractive at first, those soft constraints were hard for us in all of our tentatives to use as a replacement of a database for analytics usage.

All the implementations we were confronted to had at least one of those limitations:

cannot support “UNIQUE”
cannot support finely grained number constraints
cannot support “FOREIGN KEYS”

For instance, while diagnosing some security breach through emails, it is important to make sure that an email has a sender, and this sender must exist in our user referential.

In all our successful BI projects, it was generally a better decision to reject 30% of the data due to a lack of conformity than hoping to reconcile later those values by different treatments. To illustrate this claim, let’s take the example of a report of the sales of the month per store. If someone input by mistake a negative transaction or a transaction clearly out of bound (like $1B for a clothing shop), all reports will be worthless. Missing one transaction, at large scale, would have no impact whatsoever in the decision made from the reports.

So if constraints reinforcements provide an important value to the report generated by the project , the constraints not supported by the engine must be implemented by the application. It does add a strong burden on the development team. In the light of our experience, we do think that this extra cost is usually superior to the extra value Search Engines currently give.

Search engines are surprisingly less agile for BI

It is common in BI to go back on already stored data and discover a new way to use them, a new interesting indicator to build. Like splitting the mail addresses in two different fields, one part for the name and another for the domain.

In 2010, BI developers spend @matterstech x2.1 more time on schema changes than traditional developers.

Usually, data inside a SE are harder to migrate to a new schema. Sure, Solr and Exalead provide such functionality but it is in a limited upsert way (cancel and replace). It means that if one wants to update data, it needs to extract every entries, change it and push again in the repository. This limitation causes headaches to engineers when the functional requirements are changing.

On the contrary, a database can easily update later data already stored, especially with emergent technologies like columnar databases.

Conclusion

Despite being really powerful for searching and discovery, we have yet to find a good use case for search engines in Business Intelligence. The difficulties in rejecting polluted entries, as well as the data migration challenges make two of the most important tasks of a Business Intelligence project unexpectedly difficult.

“Schema-less” technologies provided us tremendous value in data mining use cases, but our claim is that a SQL schema full database was usually a better starting point for any decision-making application.

Originally published at inovia.fr on June 27, 2011.