The Vs of Big Data

Ryan Ge
CISS AL Big Data
Published in
6 min readOct 25, 2022

Big data refers to enormous data sets that are beyond the ability to be dealt with by traditional data-application software. The increased popularity and pervasiveness of online platforms, websites, and social media have made data more accessible. This accessibility has allowed data to grow exponentially in size and number. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s. In the article “How Much Data Do We Create Every Day?” by Forbes published in 2018, already 2.5 quintillion bytes of data is created every day 4 years ago. That is 2.5 followed by 18 zeros! The International Data Group (IDG) predicts that by 2025, the global data volume will be 163 zettabytes, which is 163 followed by 21 zeros. Big data requires a set of methods and technologies to reveal insights in datasets that are miscellaneous, composite, and on immense scales.

Figure 1: Example of big data (https://www.truedigitalpark.com/en/insights/articles/88/3-ways-to-manage-big-data)

Big data can be characterized by Vs. these Vs are what allow data and information to be considered big data, and these Vs also ensure big data analysis functions successfully. Different versions of Vs can differ in the number of Vs, but the main Vs in big data are Volume, Variety, Velocity, Veracity, and Value. Volume refers to the large amount of data that is needed to be analyzed. The term “big data” itself is related to a size that is vast. Variety refers to the different forms of data, or in other words, the nature of the data. Data and information are heterogenous, they exist in different forms, so it is important that we categorize them. Velocity refers to the high speed of accumulation of data. There is a massive and continuous flow of data. Velocity determines the potential of data and how fast the data is generated and processed to meet the demands. Veracity refers to the discrepancies and uncertainty in data. Data and information can get very messy, so the quality and precision of the data are hard to control. Veracity also exists in big data due to information from separate data sources, and data existing in different dimensions and types. Value refers to the usefulness of the data. Data itself is just numbers or words, data alone has no importance, impact, or value. People need to convert the data into valuable information that could be valuable to a company or government. Today we will be diving into one of the most important Vs, which is Variety.

Figure 2: Infographic on the 5 Vs (https://informationcatalyst.com/vision-experience/big-data-value/)

As briefly mentioned in the paragraph above, Variety refers to the different forms and types of data. Data can be images, videos, text, audio, recordings, files, etc. All forms of data can be sorted into three categories — structured data, unstructured data, and semi-structured data.

Structured data refers to traditional data that is organized and imitated. Examples of structured data include names, addresses, credit card numbers, stock information, geological location (latitude and longitudes), bank notes, and a lot more. Structured data is a formatted way in which people can interpret and access information easily and conveniently. A scenario in which structured data is used in our everyday life is Google. According to an article published by skai.io named “How Many Google Searches Per Day Are There?”, Google handles 3.8 million searches per minute on average all over the world. That is a lot of information. Google search uses structured data to support special search result features and developments. A recipe page website with valid structured data is suitable to emerge in a graphical search result, which shows the rating of the recipe, time for recipe, views, calories, etc.

Unstructured data refers to data that is unorganized. These are data that cannot be ordered into rows and columns or have any relationship with each other. Examples of unstructured data would be media and entertainment. Images, videos, and audio files are all cases of unstructured data. Ways to process unstructured data would be to pull out information such as name, time, and location within unstructured data to categorize and sort it. Retailers and manufacturers use unstructured data to increase consumer experience and target marketing. Companies also use unstructured data to better understand their customers, their attitudes and perspective about products, and their feelings about customer service, etc.

Figure 3: Apps that are examples of unstructured data (https://sensortower.com/blog/top-apps-worldwide-q3-2019)

Semi-structured data refers to data that has some structure to it but also contains information that is non-traditional. Examples of semi-structured data include emails, zipped files, CSV files, etc. An example of semi-structured data that we all encounter every day and are the most familiar with would be email. According to an article published by tripwire.com named “Email and cybersecurity: Fraudsters are knocking”, approximately 333.2 billion emails are sent per day, which is over 3.5 million emails sent per second. Emails contain structured data such as the email address, your name, date, time, and recipient. This allows us to use the search feature in Outlook or Gmail to search for keywords, sender, date, etc. However, the email also contains unstructured data which are the attachments to the email. Attachments of emails can be images, videos, infographics, audio files, etc. The files can be in word, excel, pdf, jpeg, and a lot more.

Figure 4: Email apps that use semi-structured data ( https://ru02.net/bhtp/gmail-outlook-inbox-mailer/)

Now that we have learned what Variety is, let’s explore why is Variety the most important V. The world we live in today is very different compared to the world 20 years ago. Today, information and data are more diverse, and data exist in different forms. Different forms of data need different ways and tools to be analyzed. For example, categorizing and sorting data in the form of pictures and videos will be very different from sorting data in word documents and letters. You might find the same results from different forms of data, but the analysis you do for each will be different. And therefore, the first step of the process is sorting them into the correct categories (structured, unstructured, or semi-structured). Using the wrong analysis for the wrong category can provide false outcomes which can be misleading, diminishing the value (another V) of the data. So, in order to maximize the value of the data and information that is collected, we need to first sort the data into the correct categories, then use the right analysis to draw the conclusion.

In conclusion, data exists in numerous sizes and forms. To be able to interpret and use these data, data is sorted into three categories — structured, unstructured, and semi-structured. Only when the data is sorted into the right category can the data be used to generate value. Therefore, we need to fully understand the three categories of data because it is the basis of big data.

References:

Gutta, S. (2021, August 4). The 5 V’s of Big Data. Medium. https://medium.com/analytics-vidhya/the-5-vs-of-big-data-2758bfcc51d

Marr, B. (2018, May 21). How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read. Forbes. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/?sh=328a41b160ba

Reinsel, D., Gantz, J., & Rydning, J. (2018). The Digitization of the World From Edge to Core. https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

Scorey, L. (2022, August 21). Email and cybersecurity: Fraudsters are knocking | Tripwire. Www.tripwire.com. https://www.tripwire.com/state-of-security/email-cybersecurity-fraudsters-are-knocking#:~:text=Emails%20are%20a%20necessary%20evil

Skai. (2019, February 25). How Many Google Searches Per Day? SEM Pros Should Know This! SkaiTM. https://skai.io/monday-morning-metrics-daily-searches-on-google-and-other-google-facts/#:~:text=Although%20Google%20does%20not%20share

Wolff, R. (2020, November 16). What Is Semi-Structured Data? MonkeyLearn Blog. https://monkeylearn.com/blog/semi-structured-data/#:~:text=Email

--

--