Historically, organizations have mainly focused on structured data for any kind of visualization or analysis. Traditional systems and reporting still rely on this form of data.
However, with evolving world and swift increase in semi-structured and unstructured data sources businesses are evolving too! They are now exploring all 3 kinds of data, interpreting them and acting on those insights.
This article will walk you through what is Structured data, Semi-structured data and Unstructured data with examples. Also, present few differences between each of them to understand even better.
Before we proceed, let’s see what is a Data and Data structure.
Data is a set of facts such as numbers, descriptions, and observations and it is fundamental to business decision. It comes in many forms.
Data Structure is a collection of data values, relationships among them and operations that can be applied to the data.
Structured data is information that has been formatted and transformed into well-defined data model. It resides in predefined formats and is typically stored in a relational database (RDBMS). This means it conforms to a tabular format with rows and columns.
This kind of data is created by both humans and machine. You will find structured data in financial data, bar codes, machine logs, customer star ratings and so on.
Example: Excel files and SQL databases
Unstructured data is stored in its natural format until it’s extracted for analysis. It lacks pre-defined format and due to its nature, it cannot be stored in a relational database. Lack of structure made it more difficult to search, manage and analyze them. Hence businesses have widely discarded them. However, ability to store and process unstructured data has greatly grown in recent years. They are typically stored in data lakes (Data warehouse).
The amount of unstructured data is much larger than that of structured data. You will find unstructured data in social media activity, surveillance imagery, audio files, satellite imagery and so on.
Example: Word, Media logs, Text, PDF
Data may always not be structured or Unstructured — we have another category between them called Semi-structured data. Semi-structured data is a mix of both structured and unstructured data. It has definite characteristics but does not confine into a rigid structure which is required for relational database.
Photos in a smart phone is a semi-structured data. Every photo captured using a smart phone contains unstructured image content. However, they are also tagged with location, time stamp, device ID. After being saved in a device they can be tagged as a pet/ dog or a person name. In this case, you find a structure in an unstructured data. Interesting isn’t it?
Example: XML data, JSON documents, EDI and CSV
To further explain our topic without any technical terms, let’s use an analogy of a job interview process.
For context, a structured interview is one in which type of questions and order in which they are asked is pre-determined by HR team. This remains same for every candidate.
On the other hand, Unstructured interview process is one in which type of questions and order in which they are asked is purely interviewer’s discretion and could be entirely different for each candidate.
In Semi-structured interview process, interviewer will combine both Structured and Unstructured interview process. Here you will find few pre-determined questions but in a flexible order. Also, interviewer is provided with a window for building rapport and follow up questions.
Now let’s look at few differences between Structured data, Unstructured data and Semi-Structured data.
*Schema influences how data is stored and accessed.
*Tuple is one of 4 built-in data types in Python used to store collections of data which is ordered and unchangeable.
Businesses have access to massive amounts of structured, semi-structured, and unstructured data from different sources in all formats. Separating the data accordingly to its types is the first step to get the most out of the valuable data. Hope you enjoyed reading this article!
Originally published at https://www.numpyninja.com on April 27, 2021.