The Variety in Big Data

Ellina
CISS AL Big Data
Published in
6 min readOct 25, 2022

The science of Big Data can be defined by thirteen, most commonly known V’s. This includes Vagueness, Variability, Volume, Velocity, Variety, Venue, Viscosity, Visualization, Votality, Validity, Veracity, Volume, and Value. (Arockia Panimalar.S., 2017) These characteristics of Big Data serve as fundamental principles and help illustrate the emerging era of Big Data.

Figure 1: Major V’s in Big Data (https://dzone.com/articles/why-is-big-data-in-buzz)

Though each individual V constitutes an important component of big data (Figure 1), this new type of data science cannot be itself without Variety. Big Data Analytics analyzes every form of data that can be found — images, videos, text input, social media posts, IP addresses, and much more. The data involved in big data is comprised of a variety of structured and unstructured data. What are these types of data and where are they seen in our daily life?

Structured data types are used after quantifying all sorts of data and combining them into usable, analyzable data. It refers to data that can be arranged and stored in databases, spreadsheets, and datasets (Marr, 2019). For example, a company might have a table or list of the employees’ home addresses, phone numbers, salaries, and ages. Or, in a big data analysis project, the structured data set might have a column that contains numbers indicating the safety of a city on a scale of one to ten, which is quantified data. Perhaps you haven’t realized, but the use of structured data helps us daily. Say you want to look for good restaurants. How do you find one? You check its ratings. And how is the average star rating provided? The rating each person gives to the restaurant is a type of structured data: it goes into the system as a number, usually from 1–5. Using the structured data set, the analytics software can determine the average star rating easily, allowing us to settle on our dinner choice (Owen, 2020). In Figure 2, the app, utilizing structured data, tells customers about the foods’ taste, services, and environment of the restaurant. Although structured data is a lot easier to manage and interpret, it only holds about 5% of the world’s information (Huh, 2021).

Figure 2: Star Reviews of a Restaurant — From DaZhongDianPing App

Unstructured data, on the other hand, provides far more insights and information than structured data. This type of data refers to data that is not organized in a specific format. The most prevalent example of unstructured data would be those on social media platforms. Both posts with media and posts with only words count as unstructured data because you would not be able to tabulate them. In other words, this is almost like the “raw data”. You need to turn them into a somewhat structured format before you can perform any analysis on them. Another prevalent example of unstructured data is sales orders for sales analytics. For instance, companies can consider factors like a customer’s purchasing behavior. These include when certain products are ordered, how many of these products are ordered, and what other products are ordered at the same time. These factors are unstructured data, and they assist companies in finding buying patterns, and could potentially predict or forecast the demand for a product (Ham, 2020). The same data can create a better user experience, like when Amazon Books give you personalized recommendations based on the books you’ve bought before (Figure 3). Unstructured data, though harder to quantify, provides so much more information to help identify trends and patterns.

Figure 3: Personalized recommendations on Amazon based on unstructured data (https://medium.com/marketing-in-the-age-of-digital/amazons-marketing-personalization-fcbed690ffd8)

The type of data in between structured and unstructured is called semi-structured, which is a little more abstract to define — taking the form of structured data but does not maintain a tabular structure. Businesses also take advantage of semi-structured data, along with unstructured data. One comprehensive example of semi-structured data would be your email inbox. Emails contain both unstructured and structured data. Just like social media posts, what is written in text is unstructured data. But the information contained in each email, like the email address, contact name, date, and time sent, are considered structured data. The email website is also organized in a structured manner — inbox, drafts, junk, and trash (Figure 4). Businesses can use this data, again, to notice trends in customer experience. If there is a customer issue, a customer service department can review the emails sent to them. Using the content and dates of the emails, which are the unstructured data and structured data respectively, they can determine if the issue was a one-time problem or a persisting one.

Figure 4: Semi-Structured Data on Email Sites (https://company.inbox.lv/2019/02/inbox-lv_email-personalisation-convenience/)

But altogether, these different data types form the Variety component in Big Data. The fact that multiple sources are used for analysis means there is a wide variety of information to process. A close consideration of this array can provide insights that could never have been found if only one data type was used. Take Google Translate, for example. Google uses all the data it can find — text messages, social media captions, research papers, Reddit posts — to power its translation software. Even though there is a huge range of data analyzed, all of them assist Google in choosing the right words to use in different contexts. They could not have achieved something like this with just data from research papers since data from such are limited to academic language. With access to Reddit posts as well, Google gains other insights into what people may use on informal occasions (Mayer-Schonberger, V., & Cukier, K. , 2014). To better understand the significance of variety, it’s essential to know the three main shifts from statistics to big data, being “more”, “messy”, and “good enough”. Variety would constitute the “messy” component, meaning that the data utilized in big data is not organized nor of high quality. Although Reddit may not be the best source to get grammatically correct sentences, therefore messiness, Google can now translate texts for both academic purposes and conversational purposes. Thus, the element of variety is the best representation of the concept of big data.

Sources:

Arockia Panimalar.S. (2017, September). The 17 V’s Of Big Data. International Research Journal of Engineering and Technology. The 17 V’s Of Big Data

Marr, B. (2019, October 18). What’s The Difference Between Structured, Semi-Structured And Unstructured Data? Forbes. https://www.forbes.com/sites/bernardmarr/2019/10/18/whats-the-difference-between-structured-semi-structured-and-unstructured-data/?sh=7f75fb062b4d

Owen, J. (2020, November 3). The Many Benefits to Using Structured Data. SEO Design Chicago. https://seodesignchicago.com/seo-blog/the-many-benefits-to-using-structured-data/

Mayer-Schonberger, V., & Cukier, K. (2014). Big Data. John Murray.

Huh, D. (2021, August 15). Semi-Structured Data: What It Is and Why It Matters. Actian. https://www.actian.com/blog/cloud-data-warehouse/the-importance-of-semi-structured-data/

Ham, Y. (2020, July 19). Unstructured Data vs Structured Data Explained with Real-life Examples. Medium. https://medium.com/@yuneeh/unstructured-data-vs-structured-data-explained-with-real-life-examples-a62dbadbb49d

Indeed Editorial Team. (2021, October). Semi-structured Data: Definition, Examples and Benefits. Indeed. https://www.indeed.com/career-advice/career-development/semi-structured-data

--

--