Big Data & Cloud Computing

Grace Kolawole
Analytics Vidhya
Published in
8 min readAug 16, 2021

--

The term big data arose under the explosive increase of global data as a technology that is able to store and process big and varied volumes of data.

The modern-day advancement is increasingly digitizing our lives which has led to the rapid growth of data. The inability of traditional data architectures to efficiently handle the new data sets brings about the concept of big data. These 4V’s of big data — volume, velocity, variety, and veracity, specifically make data management challenging for the traditional data warehouses and it presents demands for digital earth to store, transport, process, mine, and serves the data.

The volume and information captured from various mobile devices and multimedia by organizations is increasing every moment and has almost doubled every year. This sheer volume of data generated can be categorized as structured or unstructured data that cannot be easily loaded into regular relational databases. This big data requires pre-processing to convert the raw data into a clean data set and made it feasible for analysis. The advancement in data science, data storage, and cloud computing has allowed for the storage and mining of big data.

Cloud computing has resulted in increased parallel processing, scalability, accessibility, data security, virtualization of resources, and integration with data storage.

Cloud computing has eliminated the infrastructure cost required to invest in hardware, facilities, utilities, or building large data centers. Cloud infrastructure scales on-demand to support fluctuating workloads which has resulted in the scalability of data produced and consumed by the big data applications. Cloud virtualization can create a virtual platform for server operating systems and storage devices to spawn multiple machines at the same time. This provides a process to share resources and isolation of hardware to increase the access, management, analysis, and computation of the data.

Big Data

Gartner defines big data as high-volume, high-velocity, and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

Whoa, that’s a mouthful.

As it happens, big data is a hot topic, and the so-called big is just a relative concept. The reason that big data is also known as massive data or vast amounts of information is that it involves such large scales of the amount that the current mainstream data processing software tools are incapable of acquiring, managing, and processing information in a reasonable time and collecting it to assist enterprises in business decision-making. The reason that Big data emerges as a proper noun is mainly that with the rapid development of the internet, the internet of things, and cloud computing in recent years, data are produced all the time by ubiquitous mobile devices, wireless sensors; meanwhile, hundreds of millions of internet users enjoy internet services, and they also produce a huge amount of interactive data at all times.

This situation shows that a huge amount of data need to be processed and develop rapidly with the speed beyond imagination; as for enterprises, a higher and new requirement of effective and real-time data processing is proposed out of competition pressure and business needs, which is unrealistic for previous data processing means, therefore the big data technology is born at the right moment.

The 10V’s of Big Data

Building on Gartner’s definition, the concept of big data and what it encompasses can be better understood with Vs:

Volume

Refers to the incredible amount of data generated each second from different sources such as social media, cell phones, cars, credit cards, M2M sensors, photographs, and videos which would allow users to mine the hidden information and patterns found in them.

Velocity

Refers to the speed at which data is being generated, transferred, collected, and analyzed. Data generated at an ever-accelerating pace must be analyzed and the speed of transmission and access to the data must remain instantaneous to allow for real-time access to different applications that are dependent on these data.

Variety

Refers to data generated in different formats either in a structured or unstructured format. Structured data such as name, phone number, address, financials, etc can be organized within the columns of a database. This type of data is relatively easy to enter, store, query, and analyze. Unstructured data which contributes to 80% of today’s world data are more difficult to sort and extract value. Unstructured data include text messages, audio, blogs, photos, video sequences, social media updates, log files, machine, and sensor data.

Veracity

Refers to the quality and reliability of the data source. Its importance is in the context and the meaning it adds to the analysis. Knowledge of the data’s veracity in turn helps in better understanding the risks associated with analysis and business decisions based on the data set.

Value

Refers to the hidden value discovered from the data for decision making. Substantial value can be found in big data, including understanding your customers better, targeting them accordingly, optimizing processes, and improving machine or business performance.

Variability

Refers to the high inconsistency in data flow and its variation during peak periods. The variability is due to a multitude of data dimensions resulting from multiple disparate data types and sources. Variability can also refer to the inconsistent speed at which big data is ingested into the data stores.

Validity

Refers to the accuracy of the data being collected for its intended use. Proper data governance practices need to be adopted to ensure consistent data quality, common definitions, and metadata.

Vulnerability

Refers to the security aspects of the data being collected and stored.

Volatility

Refers to how long data is valid and the duration for which it needs to be stored historically before it is considered irrelevant to the current analysis.

Visualization

Refers to data being made understandable to non-technical stakeholders and decision-makers. Visualization is the creation of complex graphs that transform the data into information, information into insight, insight into knowledge, and knowledge into an advantage for decision making

Cloud Computing Technology

Rather than keeping files on a proprietary hard drive or local storage device, cloud-based storage makes it possible to save them to a remote database. As long as an electronic device has access to the web, it has access to the data and the software programs to run it.

Cloud computing is named as such because the information being accessed is found remotely in the cloud or a virtual space. Companies that provide cloud services enable users to store files and applications on remote servers and then access all the data via the Internet. This means the user is not required to be in a specific place to gain access to it, allowing the user to work remotely.

Cloud computing takes all the heavy lifting involved in crunching and processing data away from the device you carry around or sit and work at. It also moves all of that work to huge computer clusters far away in cyberspace. The Internet becomes the cloud, and voilà — your data, work, and applications are available from any device with which you can connect to the Internet, anywhere in the world.

The array of available cloud computing services is vast, but most fall into one of the following categories:

Software-as-a-service [SaaS]

Software as a Service Software as a service represents the largest cloud market and most commonly used business option in cloud services. SaaS delivers applications to users over the internet. Applications that are delivered through SaaS are maintained by third-party vendors and interfaces are accessed by the client through the browser. Since most SaaS applications run directly from a browser, it eliminates the need for the client to download or install any software. In SaaS vendors manage applications, runtime, data, middleware, OS, virtualization, servers, storage, and networking which makes it easy for enterprises to streamline their maintenance and support.

Platform-as-a-service [PaaS]

Platform as a Service Platform as a Service model provides hardware and software tools over the internet which are used by developers to build customized applications. PaaS makes the development, testing, and deployment of applications quick, simple, and cost-effective. This model allows businesses to design and create applications that are integrated into PaaS software components while the enterprise operations or thirty-party providers manage OS, virtualization, servers, storage, networking, and the PaaS software itself. These applications are scalable and highly available since they have cloud characteristics.

Infrastructure-as-a-service [IaaS]

Infrastructure as a Service Infrastructure as a Service cloud computing model provides a self-servicing platform for accessing, monitoring, and managing remote data center infrastructures such as compute, storage and networking services to organizations through virtualization technology. IaaS users are responsible for managing applications, data, runtime, middleware, and OS while providers still manage virtualization, servers, hard drives, storage, and networking. IaaS provides the same capabilities as data centers without having to maintain them physically.

Big Data in the Cloud -Why it makes perfect sense?

The benefits of moving to the cloud are well documented. But these benefits take on a bigger role when we talk of big data analytics.

To get a better picture of how big big data is, let’s review some statistics:

  • Over 1 billion Google searches are made and 300.4 billion emails are sent every day
  • Every minute, 65,972 Instagram photos are posted, 448,800 tweets are composed, and 500 hours worth of YouTube videos are uploaded.
  • By 2025, the number of smartphone users could reach 7.49 billion. And taking Internet of Things (IoT) into account, there could be more than 26 billion connected devices by then.

For sure, big data is really big.

Big data involves manipulating petabytes (and perhaps soon, exabytes and zettabytes) of data, and the cloud’s scalable environment makes it possible to deploy data-intensive applications that power business analytics. The cloud also simplifies connectivity and collaboration within an organization, which gives more employees access to relevant analytics and streamlines data sharing.

While it’s easy for IT leaders to recognize the advantages of putting big data in the cloud, it may not be as simple to get C-suite executives and other primary stakeholders on board. But there’s a business case to be made for the big data + cloud pairing because it gives executives a better view of the business and boosts data-driven decision-making.

Whatever perspective you may have, big data complemented with an agile cloud platform can affect significant change in the way organizational objectives are achieved.

A Research survey by Forrester in 2020 revealed that big data solutions via cloud subscriptions will increase about 7.5 times faster than on-premise options. Many enterprises are already making the move!

Connect with me on LinkedIn: https://www.linkedin.com/in/grace-kolawole/ and Twitter: https://twitter.com/Graceblarc_

--

--

Grace Kolawole
Analytics Vidhya

Business Intelligence| Financial Data Science| Analytics| Finance Undergrad