In this digital world, text data is everywhere, right from tweets in twitter to parsing some text in documents, everything is associated with the text. Many machine learning based products make effective use of this text data to make amazing technologies based on topics like sentiment analysis, topic modeling, relation extraction, etc.
As text data is everywhere, therefore, it is important for us to focus and create algorithms that can help us retrieve data in minimal time with optimal relevancy. For e.g, suppose if we type in google browser “Machine Learning algorithms pdf” we would get some relevant documents as follows
Upon querying we will get the relevant documents in optimal time. So what we can infer from this process of the query is, we want the system to give us results of a query in minimal time, the relevancy of documents also becomes an important factor here, you will not get psychology documents when typing machine learning algorithms because the subject is not that much relevant to the given query. So building one such system is difficult because of several factors and tradeoff which we encounter when we build such systems for a large corpus of documents or textual data retrieval.
Now knowing the complexity of these systems let us discuss what are the differences between text retrieval and database retrieval. But before that let us go through some formal definition of these retrieval modes.
Text retrieval is a task where the system would respond to a user’s query with relevant documents. It is a preprocessor for text mining.
Database retrieval means obtaining data from a database management system such as ODBMS (wikipedia).
Now, let us compare both of these over several factors:
Structure of Data:
- Pieces of information are in unstructured or free text format in case of text retrieval based systems. According to some sources, 80% of the internet text data are unstructured.
- In database systems as most of you would know data is well structured. For e.g see how every record is well-structured in below figure
Ambiguity of Data
- In the case of text retrieval-based systems, we often mine data which are ambiguous in nature, for e.g when we mine twitter data we often come across many texts which are ambiguous words and sometimes considering only one modality of information i.e text it is more difficult to understand in which context the sentence has been used.
- Since databases are in a well-structured format the text data stored in this database have well-defined semantics or the level of ambiguity is very low. For e.g if we want a list of students in a college database who have opted for a machine learning course we can easily do it because we have well defined semantically segregated columns.
# SQl query
SELECT student_name FROM college
WHERE course_name is "machine learning"
Specification of Query
- When we talk about retrieving text from vast space of information we all have our own queries i.e there is no definite way to get particular information rather we browse the information space and try to get some relevant documents. For e.g see below figure
So the specification for getting certain text data is not well defined or we can say incomplete specification.
- For retrieving text from the database we have well-defined statements or queries because of well-structured data. So it has the complete specification for the query. For e.g SQL query
Results of the Query
- In the case of text retrieval systems, we get relevant documents as a result of a particular query.
- In the case of database retrieval systems, we have records or data records stored in the database, so when we query we get matched records in database retrieval systems.
So, these all are some important differences we should know when it comes to understanding the concepts of retrieval systems.
I hope you find this article useful. Thank you.
 C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016