Glossary of terms — Data Architecture

Anup Moncy
Data Engineering
Published in
5 min readApr 4, 2023

--

Data Modelling:

  • Entity Relationship Diagram (ERD) — Shows how entities are related.
  • Normalization — Organizes data into separate tables to minimize data redundancy.
  • Dimensional Modelling — Organizes data into fact and dimension tables to support business intelligence.
  • Data Dictionary — Documents data elements, relationships, and meanings.
  • Logical Data Model — Shows how data is organized and related in a system.
  • Physical Data Model — Shows how data is implemented in a database system.
  • Validation — Ensures the accuracy and completeness of a data model.
  • Optimization — Improves the performance of a data model.
  • Tools — Examples include ER/Studio, ERwin, and Oracle Designer.
  • UML (Unified Modeling Language) — Provides a standardized way to create visual models of software systems.

Data Integration:

  • Extract, Transform, Load (ETL): The process of extracting data from multiple sources, transforming it, and loading it into a target system.
  • Data Integration Platforms: Examples include Talend, Informatica, and IBM InfoSphere DataStage.
  • Data Profiling: The process of analyzing data to understand its structure, quality, and completeness.
  • Data Mapping: The process of identifying the relationships between data elements in different systems.
  • Data Cleansing: The process of identifying and correcting or removing inaccurate, incomplete, or irrelevant data.
  • Master Data Management (MDM): The process of managing and maintaining the master data entities of an organization.
  • Data Quality: The degree to which data meets the requirements of its intended use.
  • Data Federation: The process of combining data from multiple sources into a virtual view.
  • Data Replication: The process of copying data from one system to another in real-time or near-real-time.
  • Change Data Capture (CDC): The process of capturing and tracking changes to data in a source system, and replicating those changes to a target system.

Data Warehousing

  • Data Warehouse — A large, centralized repository of data used for reporting and analysis.
  • Data Mart — A subset of a data warehouse that is designed to serve a particular business unit or department.
  • Online Analytical Processing (OLAP) — A technology used to perform complex queries and analysis of data stored in a data warehouse.
  • Star Schema — A type of database schema used in data warehousing that is optimized for querying large amounts of data.
  • Snowflake Schema — A variation of the star schema where dimension tables are normalized into multiple related tables.
  • Fact table — A table in a data warehouse that stores measures or metrics about a specific event or transaction.
  • Dimension table — A table in a data warehouse that contains descriptive attributes about a specific object or event.
  • Slowly Changing Dimensions (SCD) — A technique used to manage changes to dimension data in a data warehouse over time.
  • Extract, Load, Transform (ELT) — A data integration process where data is first loaded into a data warehouse, and then transformed and cleaned as needed.
  • Columnar Database — A type of database that stores data in columns rather than rows, which can improve performance for analytical queries.
  • In-Memory Database — A type of database that stores data in memory for faster access and processing.
  • Data Warehouse Automation — The use of software tools to automate the design, development, and maintenance of a data warehouse.
  • Data Lineage — The ability to trace the origin and movement of data through a data warehouse.
  • Star Join Optimization — A technique used to optimize performance for complex queries on star schema databases.
  • Data Mining — The process of analyzing large datasets to identify patterns, relationships, and trends.

Big Data

  • Hadoop — An open-source software framework used for distributed storage and processing of large datasets.
  • MapReduce — A programming model used to process and analyze large datasets in parallel across a distributed system.
  • NoSQL — A class of databases that do not use a fixed schema, and are designed to handle unstructured or semi-structured data.
  • Apache Spark — An open-source data processing engine used for distributed computing and data processing.
  • Hive — A data warehouse infrastructure built on top of Hadoop for querying and analyzing large datasets.
  • Pig — A high-level programming language used to process large datasets in Hadoop.
  • HBase — A distributed, scalable NoSQL database built on top of Hadoop.
  • Cassandra — A distributed, scalable NoSQL database designed for high availability and fault tolerance.
  • Kafka — A distributed streaming platform used for building real-time data pipelines and streaming applications.
  • Big Data Analytics — The process of examining large and complex data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences, and other useful information.
  • YARN — Yet Another Resource Negotiator, a resource management layer in Hadoop that allows for more efficient resource utilization.
  • Mahout — An open-source machine learning library used to build scalable, distributed machine learning algorithms.
  • Spark Streaming — A component of Apache Spark used for real-time processing of streaming data.
  • Storm — A distributed real-time computation system used for processing and analyzing streaming data.
  • Flume — A distributed, reliable, and available service used for efficiently collecting, aggregating, and moving large amounts of log data.

Data Governance:

  • Data Governance — The overall management of the availability, usability, integrity, and security of the data used in an organization.
  • Data Stewardship — The role responsible for ensuring the proper use and management of an organization’s data assets.
  • Data Catalog — A metadata management tool used to document and track the data assets within an organization.
  • Data Lineage — The ability to trace the origin and movement of data throughout its lifecycle within an organization.
  • Data Classification — The process of categorizing data based on its sensitivity, value, or regulatory requirements.
  • Data Privacy — The protection of personal and sensitive information by an organization.
  • Data Security — The protection of data from unauthorized access, disclosure, or destruction.
  • Data Retention — The policies and procedures governing the retention and disposal of an organization’s data.

ETL Processes

  • ETL — Extract, Transform, Load — A data integration process used to extract data from source systems, transform it to fit business needs, and load it into a target data storage system.
  • Data Mapping — The process of identifying how data from source systems should be transformed and loaded into a target system.
  • Data Profiling — The process of analyzing source data to determine its structure, quality, and completeness.
  • Data Cleansing — The process of identifying and correcting or removing inaccurate or incomplete data.
  • Change Data Capture (CDC) — The process of identifying and capturing only the changes made to source data since the last extract.
  • Data Integration — The process of combining data from multiple sources into a single, unified view.
  • Data Warehouse — A large, centralized repository of integrated data used for reporting and analysis.
  • Data Mart — A subset of a data warehouse focused on a specific business function or department.
  • ETL Pipeline — The sequence of steps in an ETL process that extract, transform, and load data.
  • Data Validation — The process of ensuring that data is accurate, complete, and consistent during the ETL process.
  • ETL Tools — Software applications used to automate the ETL process and manage the movement of data between systems.
  • Data Replication — The process of copying data from one system to another in near-real time.
  • Data Transformation — The process of changing the structure or format of data during the ETL process.
  • ETL Architecture — The design of an ETL process, including the hardware, software, and data storage components.
  • ETL Framework — A set of pre-defined standards and processes used to guide the development and execution of ETL processes.

--

--