Glossary of terms — Data Architecture

Anup Moncy

Published in

Data Engineering

5 min readApr 4, 2023

Data Modelling:

Entity Relationship Diagram (ERD) — Shows how entities are related.
Normalization — Organizes data into separate tables to minimize data redundancy.
Dimensional Modelling — Organizes data into fact and dimension tables to support business intelligence.
Data Dictionary — Documents data elements, relationships, and meanings.
Logical Data Model — Shows how data is organized and related in a system.
Physical Data Model — Shows how data is implemented in a database system.
Validation — Ensures the accuracy and completeness of a data model.
Optimization — Improves the performance of a data model.
Tools — Examples include ER/Studio, ERwin, and Oracle Designer.
UML (Unified Modeling Language) — Provides a standardized way to create visual models of software systems.

Data Integration:

Extract, Transform, Load (ETL): The process of extracting data from multiple sources, transforming it, and loading it into a target system.
Data Integration Platforms: Examples include Talend, Informatica, and IBM InfoSphere DataStage.
Data Profiling: The process of analyzing data to understand its structure, quality, and completeness.
Data Mapping: The process of identifying the relationships between data elements in different systems.
Data Cleansing: The process of identifying and correcting or removing inaccurate, incomplete, or irrelevant data.
Master Data Management (MDM): The process of managing and maintaining the master data entities of an organization.
Data Quality: The degree to which data meets the requirements of its intended use.
Data Federation: The process of combining data from multiple sources into a virtual view.
Data Replication: The process of copying data from one system to another in real-time or near-real-time.
Change Data Capture (CDC): The process of capturing and tracking changes to data in a source system, and replicating those changes to a target system.

Data Warehousing

Data Warehouse — A large, centralized repository of data used for reporting and analysis.
Data Mart — A subset of a data warehouse that is designed to serve a particular business unit or department.
Online Analytical Processing (OLAP) — A technology used to perform complex queries and analysis of data stored in a data warehouse.
Star Schema — A type of database schema used in data warehousing that is optimized for querying large amounts of data.
Snowflake Schema — A variation of the star schema where dimension tables are normalized into multiple related tables.
Fact table — A table in a data warehouse that stores measures or metrics about a specific event or transaction.
Dimension table — A table in a data warehouse that contains descriptive attributes about a specific object or event.
Slowly Changing Dimensions (SCD) — A technique used to manage changes to dimension data in a data warehouse over time.
Extract, Load, Transform (ELT) — A data integration process where data is first loaded into a data warehouse, and then transformed and cleaned as needed.
Columnar Database — A type of database that stores data in columns rather than rows, which can improve performance for analytical queries.
In-Memory Database — A type of database that stores data in memory for faster access and processing.
Data Warehouse Automation — The use of software tools to automate the design, development, and maintenance of a data warehouse.
Data Lineage — The ability to trace the origin and movement of data through a data warehouse.
Star Join Optimization — A technique used to optimize performance for complex queries on star schema databases.
Data Mining — The process of analyzing large datasets to identify patterns, relationships, and trends.

Big Data

Hadoop — An open-source software framework used for distributed storage and processing of large datasets.
MapReduce — A programming model used to process and analyze large datasets in parallel across a distributed system.
NoSQL — A class of databases that do not use a fixed schema, and are designed to handle unstructured or semi-structured data.
Apache Spark — An open-source data processing engine used for distributed computing and data processing.
Hive — A data warehouse infrastructure built on top of Hadoop for querying and analyzing large datasets.
Pig — A high-level programming language used to process large datasets in Hadoop.
HBase — A distributed, scalable NoSQL database built on top of Hadoop.
Cassandra — A distributed, scalable NoSQL database designed for high availability and fault tolerance.
Kafka — A distributed streaming platform used for building real-time data pipelines and streaming applications.
Big Data Analytics — The process of examining large and complex data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences, and other useful information.
YARN — Yet Another Resource Negotiator, a resource management layer in Hadoop that allows for more efficient resource utilization.
Mahout — An open-source machine learning library used to build scalable, distributed machine learning algorithms.
Spark Streaming — A component of Apache Spark used for real-time processing of streaming data.
Storm — A distributed real-time computation system used for processing and analyzing streaming data.
Flume — A distributed, reliable, and available service used for efficiently collecting, aggregating, and moving large amounts of log data.

Data Governance:

Data Governance — The overall management of the availability, usability, integrity, and security of the data used in an organization.
Data Stewardship — The role responsible for ensuring the proper use and management of an organization’s data assets.
Data Catalog — A metadata management tool used to document and track the data assets within an organization.
Data Lineage — The ability to trace the origin and movement of data throughout its lifecycle within an organization.
Data Classification — The process of categorizing data based on its sensitivity, value, or regulatory requirements.
Data Privacy — The protection of personal and sensitive information by an organization.
Data Security — The protection of data from unauthorized access, disclosure, or destruction.
Data Retention — The policies and procedures governing the retention and disposal of an organization’s data.

ETL Processes

ETL — Extract, Transform, Load — A data integration process used to extract data from source systems, transform it to fit business needs, and load it into a target data storage system.
Data Mapping — The process of identifying how data from source systems should be transformed and loaded into a target system.
Data Profiling — The process of analyzing source data to determine its structure, quality, and completeness.
Data Cleansing — The process of identifying and correcting or removing inaccurate or incomplete data.
Change Data Capture (CDC) — The process of identifying and capturing only the changes made to source data since the last extract.
Data Integration — The process of combining data from multiple sources into a single, unified view.
Data Warehouse — A large, centralized repository of integrated data used for reporting and analysis.
Data Mart — A subset of a data warehouse focused on a specific business function or department.
ETL Pipeline — The sequence of steps in an ETL process that extract, transform, and load data.
Data Validation — The process of ensuring that data is accurate, complete, and consistent during the ETL process.
ETL Tools — Software applications used to automate the ETL process and manage the movement of data between systems.
Data Replication — The process of copying data from one system to another in near-real time.
Data Transformation — The process of changing the structure or format of data during the ETL process.
ETL Architecture — The design of an ETL process, including the hardware, software, and data storage components.
ETL Framework — A set of pre-defined standards and processes used to guide the development and execution of ETL processes.

Glossary of terms — Data Architecture

Data Modelling:

Data Integration:

Data Warehousing

Big Data

ETL Processes

Written by Anup Moncy