3V’s of Big Data
Volume, Variety & Velocity
In the big data world the data has a few properties if it is big data. These are termed in 3V’s of big data, namely Volume, Variety and Velocity. Lets take a look at each in detail.
The price of the storage per megabyte or gigabyte is drastically decreased over the last decade. The amount of data which would cost say $1 now might cost $0.01, just in a matter of ten years. Reliability in storage has also improved proportionally, network storage architectures has helped improve that. This has resulted in dramatic raise in capturing and storing data. The ‘Volume’ of data has increased. The amount of data stored is referred to as the volume of data.
Previously when data storage was not so cheap the data that was stored was critical, meaning the only critical components of the data were stored. For e.g. Sales data, Transactional data…etc.
After abundant availability of storage space data which was not so critical to business but was generated and thrown away also was stored. For e.g. server logs, user behavior … etc.
e.g. of server log:
188.8.131.52 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
With the revolution of IT in all the fields the data storage demands for the Medical, I0T, Manufacture, SocialMedia …etc. have also increased giving rise to increased volume in data storage.
The format in which data is available today is very wide. Data is available in text, image, audio, video, or a combination of any of these formats. The ‘Variety’ i.e. the diversity in representation of the data is what we are referring to here.
One might argue that we do not need to store data in its original format it can always be transformed and stored getting rid of the variety, but in doing so we loose information. For e.g. lets say there is a 5 min audio conversation, we can convert it to text and store it, but it will not be as good as the original conversion, the signals such as the emotions of the speaker, the tone and other signals which make sense in vocal communication are lost. And since storage is getting cheaper and cheaper its beneficial to store varied data.
Velocity is the rate at which data is generated. The data generated by big internet companies is at terabytes/day or petabytes/week. We need to account for those needs and have systems in place to store data at that velocity. Some data might be generated fast some might be generated slow, depending upon the application in question the data velocity might differ.
The 3 V’s govern data and it is a good practice to have those in mind while devising a solution.