How Big Data is Reshaping Computer Science
You may confuse on the word Big Data. So, what is big data? Big data is considered as the data which are very large in size. Normally, we measure data in Megabytes (MB) and Gigabytes (GB). But data in Petabytes (1015) are called as big data.
The Four V’s of Big Data
Big data poses different challenges when designing algorithms or software systems that can deal with them. Big data is exciting and can change health care policy decisions and the way we do business. But to harness these benefits, we need to address several challenges first. First of all, I’ll walk you through the four V’s of big data which captures the challenges that you will face when involving with big data. The four V’s refer to volume, velocity, veracity and variety.
Volume in big data refers to a large amount of data that you have to deal with. Nowadays, data is produced in a very large quantity. Let’s take surveillance cameras installed in a major city as an example. The number of these cameras might be in the thousands and each of them is providing a constant video stream, resulting in massive amounts of data even within one day.
Velocity refers to the speed at which the data arrives. Again, if we consider surveillance cameras, they provide data at a constant speed and often at high resolution. This results in providing lots of data at high speeds. The internet also provides a vast amount of data at very high speed. A company’s firewall system has to monitor the high-speed data which that try to enter their network. In the context of cyber security, it’s crucial to deal with this data of high velocity and to make sure that it’s not a cyber-attack. Due to the high velocity of data, it might not be feasible to store or check all of the data. To cope with this issue, we look at sampling techniques that store a representative fraction of the data.
This refers to the uncertainty of data. Often, data is not complete and can be noisy. So you cannot be completely rely on all of the data that arrives because there may be abnormalities within that data. When we take location services on phones, if every user provides their location, then this location is usually not precise. The data may not be complete, as the GPS coordinates cannot be obtained at some locations.
Lastly, the variety of big data refers to the different sources of data. Data can come in various forms such as images, videos, audios, and sensor data and so on. For a particular application, you might have to integrate data from various sources.
Likewise, the four V’s of big data are key to understanding the challenges in big data. Here are some more examples of different sources of big data, and how you can analyze them with respect to the four Vs.
There are millions of people using Facebook and Twitter. All the data is produced in an online fashion arriving in the form of a data stream. Users post a variety of data online on Facebook, such as text, images and videos. Similarly, Twitter has short text messages. The data is high volume and arrives with high velocity at the Facebook/Twitter servers. Users may be tagged by their location using GPS coordinates. These coordinates are usually imprecise leading to veracity of the data.
Fraud Detection in Banking Transactions
Banking produces millions of transactions per day. These transactions have to be processed safely and reliably. Thinking about a bank’s transactions over a month results in a vast volume of data. Fraud detection refers to finding bogus transactions that have been triggered by criminals. This can be by using a stolen credit card or even only its details. You see that for fraud detection you would have to deal with large volumes of data, each transaction arriving rapidly, and a decision having to be made as soon as a transaction arrives. There are some indicators that can be used to identify fraud, for example a credit card used at an ATM in one country when all other transaction in the previous 2 days have been in another country. Finding frauds is hard and the information used to stop a transaction is usually not 100% reliable. You might even have observed this yourself when you tried to use your credit card in a different country and the card was rejected although you were the legitimate user of the card.
There are millions of people using Skype. Skype offers various types of communication: text, voice call and video call. It’s possible to send various different types of data into the text messages (PDF files, images, videos, etc.) At any given moment millions of users from various locations around the world can be using Skype. Doing so, they produce a high volume of data that is arriving at the Skype server rapidly. This data has a high variety in terms of the different forms used for communication.
Online stores such as Amazon have millions of potential customers that buy a large variety of items from their online servers. These customers produce a very large number of transactions within a short time period. Mining these transactions to extract useful information (for example to optimize advertising) has to deal with the large amount of users and the variety of items that they have bought. Making a recommendation to one particular user takes into account what the user has bought so far.The knowledge gathered about a customer is incomplete and a recommendation system has to rely on the imprecise information that it can obtain from the transaction data of customers and their behavior on the online store page.
Let’s take a break till my next story. Till then, happy coding :-)