What is Big Data?
Big data can be defined as large volume of data, data streams that have a large velocity, and/or a large variety of data formats. SAS also defines two additional dimensions of big data: variability (inconsistent or perodic data flows) and complexity (data in different formats and types which are difficult to link, match and cleanse). Michael Driscoll defines big data as data that is distributed (ie. data stored in multiple nodes across multiple databases). Big data is greater than 1 terabyte, and is managed using Hadoop or distributed databases.
Small data, on the other hand, is comprehensible and accessible to people, not just computers. Wikipedia explains that big data is about machines, while small data is about people. In fact, big data is often too big to understand without breaking it into small chunks that can be analyzed and visualized by a person. Small data takes up fewer than 10 gigabytes, can be managed using Excel or R (or Python), and can fit on one machine’s memory.
Michael Driscoll lists a third category, medium data, that can fit on a single machine’s disk. This size of data is often managed with indexed files or a monolithic (self-contained, and independent from other computing applications) database.
What does all of this mean for us, analysts who want to work with data? Each role — small, medium and big — requires specialization and training. As data sets become bigger, things tend to get progressively more complex and less “human”. Working with data at the medium level requires the ability to write SQL queries or program in SAS. Working with larger data in Hadoop (or other distributed data storage and processing frameworks) requires knowledge of a general-purpose programming language such as Java, Python or Scala. Some analysts work at all levels, but it’s more common for a person to work with data at only one or two scales on a regular basis. The only advice I can give for beginners is to commit to a language (or software) rather than trying to learn them all.
Originally published at danstrong.tech.