Big Data Terminology: 80 Definitions Every Marketer Should Know

Hurree
Geek Culture
Published in
15 min readAug 31, 2021

As a marketer, you likely have a rather extensive vocabulary when writing for your industry, and you’re always hot on the tail of any new marketing trends or buzzwords. We know this to be true because here at Hurree, we’re obsessed with marketers and how they work. More specifically, we’re obsessed with making their lives easier.

So we thought, why not help out with one of the most essential and most jargon-heavy elements of marketing today: big data. It’s something that, before recently, many marketers may not have known much about, but now they are faced with the urgent need to understand.

That’s why we’ve created this bumper list of big data terminology that every marketer should know, from beginner-level phrases to highly technical definitions. It’s a handy guide to take with you on your travels throughout the big bad world of data-driven marketing so that you always have a marketer-friendly explanation for big data terms.

So, let’s get started…

Big Data Terminology: Definitions Every Marketer Should Know

1. Abstraction layer

A translation layer that transforms high-level requests into low-level functions and actions. Data abstraction sees the essential details needed to perform a function removed, leaving behind the complex, unnecessary data in the system. The complex, unneeded data is hidden from the client, and a simplified representation is presented.

A typical example of an abstraction layer is an API (application programming interface) between an application and an operating system.

2. API

API is an acronym used for Application Programming Interface, a software connection between computers or computer programs. APIs are not databases or servers but rather the code and rules that allow access to and sharing of information between servers, applications, etc.

Source: ColorWhistle

3. Aggregation

Data aggregation refers to the process of collecting data and presenting it in a summarised format. The data can be gathered from multiple sources to be combined for a summary.

4. Algorithms

In computer science, an algorithm is a set of well-defined rules that solve a mathematical or computational problem when implemented. Algorithms are used to carry out calculations, data processing, machine learning, search engine optimisation, and more.

5. Analytics

Systems and techniques of computational analysis and interpretation of large amounts of data or statistics. Analytics are used to derive insights, spot patterns, and optimise business performance.

6. Applications

An application is any computer software or program designed to be used by end-users to perform specific tasks. Applications or apps can be desktop, web, or mobile-based.

7. Avro

Avro (or Apache Avro) is an open-source data serialisation system.

8. Binary classification

Binary classification is a technique used to identify whether a set of two elements are in one group or another based on classification rules. For example, binary classification techniques are used to determine whether a disease is present in medical data. In computing, it determines whether a piece of content should be included in search results based on its relevance or value to the users.

9. Business Intelligence

Business intelligence is a process of collecting and preparing internal and external data for analysis; this often includes data visualisation techniques (graphs, pie charts, scatter plots, etc.) presented on business intelligence dashboards. By harnessing business intelligence, organisations can make faster, more informed business decisions.

Source: MPercent Academy

10. Byte

In computing, a byte is a unit of data that is eight binary digits (bits) long. A byte is a unit of memory size; a single byte is the smallest unit of storage; thus, in computing, we usually refer to gigabytes (GB, one billion bytes) and terabytes (TB, one trillion bytes).

11. C

C is a programming language, and it’s one of the oldest programming languages around. Despite its age, it continues to be one of the most prevalent as it powers systems like Microsoft Windows and Mac.

12. CPU

This acronym stands for Central Processing Unit. A CPU is often referred to as the brains of a computer — you will find one in your phone, smartwatch, tablet, etc. Despite being one of many processing systems within a computer, a CPU is vitally important as it controls the ability to perform calculations, take actions and run programs.

13. Cascading

Cascading is a type of software designed for use with Hadoop for the creation of data-driven applications. Cascading software creates an abstraction layer that enables complex data processing workflows and masks the underlying complexity of MapReduce processes.

14. Cleaning data

Cleaning data improves data quality by removing errors, corruptions, duplications, and formatting inconsistencies from datasets.

15. Cloud

Cloud technology, or The Cloud as it is often referred to, is a network of servers that users access via the internet and the applications and software that run on those servers. Cloud computing has removed the need for companies to manage physical data servers or run software applications on their own devices — meaning that users can now access files from almost any location or device.

The cloud is made possible through virtualisation — a technology that mimics a physical server but in virtual, digital form, A.K.A virtual machine.

16. Command

In computing, a command is a direction sent to a computer program ordering it to perform a specific action. Commands can be facilitated by command-line interfaces, via a network service protocol, or as an event in a graphical user interface.

17. Computer architecture

Computer architecture specifies the rules, standards, and formats of the hardware and software that makes up a computer system or platform. The architecture acts as a blueprint for how a computer system is designed and what other systems it is compatible with.

18. Connected devices

Physical objects that connect with each other and other systems via the internet. Connected devices are most commonly monitored and controlled remotely by mobile applications, for example, via Bluetooth, WiFi, LTE or wired connection.

19. Data access

Data access is the ability to access, modify, move or copy data on-demand and on a self-service basis. Specifically, data access refers to IT systems, wherein the data may be sensitive and require authentication and authorisation from the organisation that holds the data to access.

There are two forms of data access:

  • Random access
  • Sequential access

20. Data capture

Data capture refers to collecting information from either paper or electronic documents and converting it into a format that a computer can read. Data capture can be automated to reduce the need for manual data entry and accelerate the process.

21. Data ingestion

Data ingestion is the process of moving data from various sources into a central repository such as a data warehouse where it can be stored, accessed, analysed, and used by an organisation.

22. Data integrity

The practice of ensuring data remains accurate, valid and consistent throughout the entire data life cycle. Data integrity incorporates logical integrity (a process) and physical integrity (a state).

23. Data lake

A data lake is a centralised repository that stores vast amounts of raw data — data that has not been prepared, processed, or manipulated to fit a particular schema. Data lakes house both structured and unstructured data and use an ‘on-read’ schema during data analysis.

24. Data management

Data management is an overarching strategy of data use that guides organisations to collect, store, analyse and use their data securely and cost-effectively via policies and regulations.

25. Data processing

The process of transforming raw data into a format that can be read by a machine or, in other words, turning data into something usable. Once processed, businesses can use data to glean insights and make decisions.

26. Data serialisation

Data serialisation is a data translation process that enables complex or large data structures or object states to be changed to formats that can be more easily stored, transferred and distributed. After serialisation and the chosen data action, the byte sequence can create an identical clone of the original — a process known as deserialisation.

27. Data storage

Refers to collecting and recording data to be retained for future use on computers or other devices. In its most common form, data storage occurs in three ways: file storage, block storage, and object storage.

28. Data tagging

Data tagging is a type of categorisation process that allows users to better organise types of data (websites, blog posts, photos, etc.) using tags or keywords.

29. Data visualisation

This process sees large amounts of data translated into visual formats such as graphs, pie charts, scatter charts, etc. Visualisations can be better understood by the human brain and accelerate the rate of insight retrieval for organisations.

30. Data warehouse

A centralised repository of information that enterprises can use to support business intelligence (BI) activities such as analytics. Data warehouses typically integrate historical data from various sources.

31. Decision trees

Decision trees are visual representations of processes and options that help machines mark complex predictions or decisions when faced with many choices and outcomes. Decision trees are directional acyclic graphs made up of branch nodes, edges, and leaf nodes with all data flowing in one direction.

Source: Edureka!

32. Deep learning

Deep learning is a function of artificial intelligence and machine learning that mimics the processes of the human brain to make decisions, process data, and create patterns. It can be used to process huge amounts of unstructured data that would take human brains years to understand. Deep learning algorithms can recognise objects and speech, translate languages, etc.

33. ETL

An acronym used to describe a process within data integration: Extract, Transform and Load.

34. ELT

An acronym used to describe a process within data integration: Extract, Load, and Transform.

35. Encoding

In computing, encoding refers to assigning numerical values to categories. For example, male and female would be encoded to be represented by 1 and 2.

There are two main types of encoding:

  • Binary
  • Target-based

36. Fault tolerance

The term fault tolerance describes the ability of a system, for example, a computer or a cloud cluster, to continue operating uninterrupted despite one or more of its components failing.

Fault tolerance is developed to ensure a high level of availability and that no business is impacted by a loss of critical systems or continuity. Fault tolerance is achieved by utilising backup components in hardware, software, and power solutions.

37. Flume

Flume is open-source software that facilitates the collecting, aggregating and moving of huge amounts of unstructured, streaming data such as log data and events. Flume has a simple and flexible architecture, moving data from various servers to a centralised data store.

38. GPS

GPS is an acronym for Global Positioning System, which is a navigation system that uses data from satellites and algorithms to synchronise location, space, and time data. GPS utilises three key segments: satellites, ground control, and user equipment.

Source: MarketsandMarkets

39. Granular Computing (GrC)

An emerging concept and technique of information processing within big data, granular computing sees data divided down into information granules or ‘collection of entities’ as it is referred to. The point of this division is to discover whether data is different on a granular level.

40. GraphX

An API from Apache Spark that is used for graphs and graph-parallel computing. GraphX facilitates faster, more flexible data analytics.

41. HCatalog

In its simplest form, HCatalog exists to provide an interface between Apache Hive, Pig and MapReduce. Since all three data processing tools have different systems for processing data, HCatalog ensures consistency. HCatalog supports users reading and writing on the grid in any format that a SerDe (serialiser-deserialiser) can be written.

42. Hadoop

Hadoop is an open-source software framework of programs and procedures that are commonly used as the backbone for big data development projects. Hadoop is made up of 4 modules, each with its own distinct purpose:

  • Distributed-File System — allows data to be easily stored in any format across a large number of storage devices.
  • MapReduce — reads and translates data into the right format for analysis (map) and carrying out mathematical calculations (reduce).
  • Hadoop Common — provides the baseline tools needed for users systems, e.g. Windows, to retrieve data from Hadoop.
  • YARN — a management module that handles the systems that carry out storage and analysis.

43. Hardware

Hardware is the physical component of any computer system, for example, the wiring, circuit board, monitor, keyboard, mouse, desktop, etc.

44. High dimensionality

In statistics, dimensionality refers to how many attributes a dataset has. Thus, high dimensionality refers to a dataset with an exceedingly large amount of attributes. When high-dimensional data occurs, calculations become extremely difficult because the number of features outweighs the number of observations.

Website analysis (e.g. ranking, advertising and crawling) is a good example of high dimensionality.

45. Hive

Hive is an open-source data warehouse software system that allows developers to carry out advanced work on Hadoop distributed file systems (HDFS) and MapReduce. Hive makes working with these tools easier by facilitating the use of a more simplified Hive-Query Language (HQL), thus, reducing the need for developers to know or write complex java code.

46. Information retrieval (IR)

A software program that handles the organisation, storage, and retrieval of information, usually of a text-based format, from large documentation repositories. A simple example of IR is search engine queries that we all carry out on Google.

47. Integration

Integration is the process of combining data from multiple disparate sources to achieve a unified view of the data for easier, more valuable operations or business intelligence.

There are five main forms of data integration:

48. Internet of things (IoT)

The internet of things (IoT) refers to an ecosystem of physical objects that are connected to the internet and generate, collect, and share data. With advancing technologies enabling smaller and smaller microchips, the IoT has transformed previously benign objects into smart devices that can submit insights without the need for human interaction.

49. Java

Java is a high-level programming language that is specifically designed to reduce programming dependencies. However, it is also used as a computing platform. Java is widely regarded as fast, secure, and reliable.

50. Latency

Data latency refers to the time it takes for a data query to be fully processed by a data warehouse or business intelligence platform. There are three main types of data latency: zero-data latency (real-time), near-time data latency (batch consolidation), some-time data latency (data is only accessed and updated when needed).

51. Machine learning

Machine learning is a branch of a technique that sees computers automatically assess problems and configure algorithmic models to solve them without the need for human interaction.

52. Mining

Mining or data mining, as it is commonly known, refers to the practice of using computer programs to identify patterns, trends and anomalies within large amounts of data and using these findings to predict future outcomes.

53. NoSQL

NoSQL is also referred to as non-SQL or not-only SQL. It is a database design approach that extends storage and querying capabilities beyond what is possible from the traditional tabular structures found in a regular relational database.

Instead, NoSQL databases use a JSON document to house data within one structure. This is a non-relational design that can handle unstructured data as it does not require a schema.

54. Non-relational database

A database system that does not use the tabular system of rows and columns.

55. Neural networks

A set of algorithms that work to recognise relationships between huge sets of data by mimicking the processes of the brain. The word neural refers to neurons in the brain which act as information messengers.

Neural networks automatically adapt to change without the need to redesign their algorithms and thus have been widely taken up in the design of financial trading software.

Source: Investopedia

56. Open-source

Open-source refers to the availability of certain types of code to be used, redistributed and even modified for free by other developers. This decentralised software development model encourages collaboration and peer production.

57. Pattern recognition

One of the cornerstones of computer science, pattern recognition, uses algorithms and machine learning to identify patterns in large amounts of data.

58. Pig

Pig is a high-level scripting language that is used to create programs that run on Hadoop.

59. Pixel

Pixels are small pieces of HTML code that are used to track users’ behaviours online, for example, when they visit a website or open an email.

60. Programming language

A programming language is a set of formal language formatted using sets of strings that instruct a computer to perform specific tasks. Programmers use languages to develop applications. There are numerous programming languages, the most common of which are Python and Java.

61. Python

Python is a high-level programming language with dynamic semantics used to develop applications at a rapid pace. Python prioritises readability making it easy to learn and cheaper due to a lessened need for program maintenance.

62. Query

In computing, a query is a request for information or a question directed toward a database. The queried data may be returned in the form of SQL (structured query language) or data visualisations such as graphs, pictorial representations, etc.

63. R

R is a free software environment for statistical computing and graphics.

64. RAM

An acronym used for Random Access Memory, which essentially refers to the short-term memory of a computer. RAM stores all of the information that a computer may need in the present and near future; this information is everything currently running on a device for example any web browser in use or game that you’re currently playing.

RAM’s fast-access capabilities make it beneficial for short-term storage, unlike a hard drive device which is slower but preferred for long term storage.

65. Relational database

A relational database exists to house and identify data items that have pre-defined relationships with one another. Relational databases can be used to gain insights into data in relation to other data via sets of tables with columns and rows. In a relational database, each row in the table has a unique ID referred to as a key.

66. SQL

SQL stands for Structured Query Language and is used to communicate with a database. SQL is the standard language used for a relational database.

67. Scalability

Scalability in databases refers to the ability to accommodate rapidly changing amounts of data processing needs. Scalability concerns both rapid increases in data (scaling-up) and decreases in demand for data processing (scaling-down). Scalability ensures that the rate of processing is consistent despite the volume of data being handled.

68. Schema on-read

A method of data analysis that applies a schema to data sets as they are extracted from a database rather than when they are pulled into that database. A data lake applies an on-read schema, allowing it to house unstructured data.

69. Schema on-write

A method of data analysis that applies a schema to data sets as they are ingested into a database. A data warehouse uses an on-write schema, meaning that data is transformed into a standardised format for storage and is ready for analysis.

70. Semi-structured data

Semi-structured data does not reside in a relational database (rows and columns); however, it still has some form of organisational formatting that enables it to be more easily processed, such as semantic tags.

71. Software

The opposite of hardware, software is a virtual set of instructions, codes, data, or programs used to perform operations via a computer.

72. Spark

Spark is a data processing and analysis framework that can quickly perform processing tasks on very large data sets or distribute tasks across multiple computers.

Spark’s architecture consists of two main components:

  • Drivers — convert the user’s code into tasks to be distributed across worker nodes
  • Executors — run on those nodes and carry out the tasks assigned to them

73. Structured data

Data that can be formatted into rows and columns, and whose elements can be mapped into clear, pre-defined fields. Typical examples of structured data are names, addresses, telephone numbers, geolocations, etc.

74. Unstructured data

Unstructured data does not have a pre-defined structure or data model and is not organised in a predefined format. Examples include images, video files, audio files, etc.

75. User Interface (UI)

A user interface or UI is the location of human-computer interaction; they are the display screens at the front-end of applications that mask the code that works behind the scenes. A user interface is designed with usability in mind to ensure that any user can easily understand and navigate the interface as this impacts user experience.

76. Variety

Part of the 4 Vs of big data, variety refers to the huge variety of data formats that data can now exist in.

77. Velocity

Part of the 4Vs of big data, velocity refers to the rapid speed at which large amounts of data can be processed.

78. Veracity

Part of the 4Vs of big data, veracity refers to the trustworthiness of big data in terms of integrity, accuracy, privacy, etc.

79. Volume

Part of the 4Vs of big data, volume refers to the huge amount of data being generated globally each data.

80. Workflows

A data science workflow defines the phases or steps to be carried out to complete a development project. In data-driven business fields, workflows are also used and referred to in terms of automating processes, marketing or sales campaigns, or internal communications.

Big data is a vast and complex field that is constantly evolving, and for that reason, it’s important to understand the basic terms and the more technical vocabulary so that your marketing can evolve with it.

Now go forth and flaunt your new knowledge to impress your colleagues and improve your content.

--

--

Hurree
Geek Culture

Hurree is a Pinboard for your Analytics 📍 Collect data from across all of your tools to create effortless company reports on one dashboard ➡ www.hurree.co