Any software works great when it has to deal with structured data such as the data that is stored in relational databases. The method to store structured data is less noisy and it is easier to retrieve information by using efficient queries. This is one of the reasons we can expect a 100% accuracy when using such systems. Organizations have taken advantage of this by trying to store most of the data they have in relational databases. Most of the economy in the world has been driven by software based on relational databases.
But you will be surprised to know that the data that is stored in most of the organizations in relational databases form only approximately 10% of the total data available. So what is this other 90% of data?
- Natural Language: The data that is available in the form of natural language is stored in different formats such as pdf, word, text. Example, in healthcare patient reports are stored in natural language. Human language comes with a lot of problems for software systems to deal with. It is unstructured and has a lot of variation. Most of the software systems are rule based systems, and human language demands infinite rules, which will be quite an expensive process.
- Images: In some organizations such as hospitals and engineering firms, images and videos form the major chunk of the data. Dealing with images is not easy. Image processing software are mostly rule based and can understand images only to a certain extent. But to gather real intelligence requires probabilistic systems which till now were a distant dream.
- Speech: In call centers, human speech forms most of the information. Human speech is noisy. To understand it requires a different kind of software paradigm which till now was not available. Even if speech is converted into text by some means it will still be in unstructured format. Just imagine if by some means one is able to perform analytics on this information then it will directly impact customer satisfaction.
This data lies in the servers of organizations and most of the time is never used once stored. Such data is called dark data because it is never analyzed as for many years since the advent of software systems there were no tools that can efficiently handle unstructured data.
To deal with complexity of unstructured data we need a new paradigm, a new method of processing. Expert systems of 1970s were trying to do the same, but the problem they faced was there were no efficient tools to deal with such complexity during that time.
So what has changed? How are we able to deal with unstructured data? Due to the progress in Artificial Intelligence, especially in the area of Machine Learning and Deep Learning, we have the right tools to process any kind of information. Knowledge base construction (KBC) does exactly this — it extracts structured information from dark data and stores it in a form that can be used in various applications such as search, question answering, etc.
In this series of KBC, we will be understanding as to how today’s software is able to handle unstructured data.
Case in Point:
For many years a clinic based in Mumbai was storing their reports in Word format. These reports contain both text and images. The text is unstructured as there is no specific format that it follows. There are different diseases mentioned in different reports. Also there are different observations that a doctor make which are written in natural language. After realizing the potential of Artificial Intelligence, the clinic approached Cere Labs to understand if analytics of those reports is possible. As Cere Labs is working on KBC, it was possible for us to take this challenging assignment. In future posts we will see how KBC makes it possible to work on such use cases.