From Database to AI: The Evolution of Data Platforms

Mariusz Kujawski
13 min readJan 8, 2024

--

The Kardashev Scale is a method for measuring a civilization’s technological advancement based on its ability to harness and utilize energy. This method classifies civilizations on a scale ranging from primitive cave dwellers to entities capable of manipulating entire galaxies. Inspired by this concept, could we apply a similar scale to assess the development of data platforms?

Assuming it’s possible, the question arises: what should serve as an indicator of our data platform’s evolution? While the amount of data processed could be a valid metric in some cases, it might not be comprehensive. For instance, processing terabytes of data may not necessarily imply the integration of machine learning.

Let’s explore potential stages of data platform evaluation to derive meaningful conclusions and identify the criteria needed to transition to higher levels of development.

Level 0

At the initial stage of data platform evolution, which I consider Level 0, reports are generated by directly accessing transaction systems and querying data from them. Tools such as Excel, Power BI, or similar applications are employed to collect and present data in the form of tables or charts. While this approach allows quick access to data, it comes with vulnerabilities related to changes in data structure, source performance, and latency. The reporting process tends to be slow, particularly when manual data manipulation is required to extract the necessary information.

Level 0

Level 1

At this stage, both your reports and applications are struggling with performance issues tied to heavy analytical queries consuming data from your transaction system daily. In response, your IT department decided that it’s time to move to another environment.

To address the performance challenges, you have several options. You can invest in a new server, create a Virtual Machine (VM), or leverage services offered by cloud providers like Azure, AWS, or GCP. Whether you opt for a VM with a manually installed database engine or choose a serverless service bundled with a database engine depends on your specific requirements.

The primary goal at this stage is to extract data from your transaction database and store it in a separate database or files. This extraction helps mitigate issues associated with heavy analytical queries that might otherwise hinder critical systems such as your accounting application, CRM, or the transaction system itself.

Data extraction can be performed in its raw form or through data replication processes. Your IT department plays a crucial role in facilitating the data transfer to your designated machine. Timing is key; data extraction is typically carried out during periods of low activity in the transaction system. Once the data is on your machine, you refresh your predefined reports in tools like Excel or Power BI.

Level 1

Level 2

At Level 2, your organization is struggling with inconsistent data and data quality issues, intensified by a high demand for ad-hoc reports that your analytical team struggles to meet. Sharing collected data within the organization proves challenging due to the need for detailed business knowledge to analyze raw application data, and historical data may not be consistently stored.

In response to these challenges, your organization decides to take a significant step forward by establishing a data warehouse. The primary objectives of the data warehouse are to create a unified and consolidated source for data analytics, ensuring that measures with the same name convey consistent meanings. The content of the data warehouse is designed to be understandable, and access is characterized by fast performance. The design is adaptive and flexible, accommodating the addition of new data sources without disrupting existing data.

The data warehouse serves as the foundation for decision-making, ensuring that it contains the right data to support informed choices. This transformation requires upgrading your environment, with considerations for selecting between a public cloud or on-premise server and choosing an appropriate data warehouse engine. Popular options for public cloud include AWS Redshift, BigQuery, Databricks, Microsoft Synapse, and Snowflake. On-premise choices may include Microsoft SQL Server, Oracle, and PostgreSQL. The selection of the engine depends on factors such as data volume, the expertise of your data team, and specific use cases.

Level 2

In addition to selecting a data warehouse engine, choosing a data modeling methodology — such as Kimball’s, Inmon’s, or Data Vault — is crucial. The distinctions between these data modeling architectures are explored in detail in my article ‘Data Modeling Techniques for Data Warehouse.’ A well-developed data model is vital for adapting to new data sources, conducting data analysis, and leveraging business intelligence tools.

Another critical aspect is the selection of tools and the development of ELT/ETL processes to integrate data sources, clean and enrich data, populate the data model, and enhance data quality. Programming languages like Python or Scala can be utilized, and there are ready-to-use tools such as SSIS, Informatica, Talend, Nifi, Fivetran, Matillion, among others. The dbt framework is an intriguing option, allowing transformations using SQL.

These steps enable the centralization of operations on data, creating a ‘single place of truth’ in your data warehouse. With the data warehouse in place, you can begin sharing data with power users, fostering the development of more reliable reports and dashboards.

However, as you work with your new data platform, you’ll encounter new challenges related to data governance. Questions may arise, such as where to find data for specific reporting purposes, who can access sensitive data and how to mask it, how to manage authorization, and how to consolidate data from systems like ERP, CRM, and external sources.

The maturity of your solution will necessitate the implementation of data management tools, including a data catalog, master data management, data classification, and data lineage.

  • Data Catalog: A repository containing metadata with tools for data management and search functionalities, assisting analysts and other data consumers in locating specific data.
  • Master Data Management (MDM): Aims to create master records for clients, merchandise, and business entities from internal and external data sources, ensuring consistency and reliability.
  • Data Classification: Organizes data into groups based on sensitivity, such as PII, PHI data, and finance data, defining the level of importance.
  • Data Lineage: Illustrates the flow of data from sources through ETL processes, aiding in error investigation, enhancing understanding, and supporting data classification.

Level 3

In organizations dealing with massive amounts of data, real-time data, and diverse sources, a traditional data warehouse may prove insufficient for analytical purposes. Data warehouses typically require time-consuming data modeling, schema definition (schema-on-write), and structured data. Additionally, the sheer volume of data can become a limiting factor for some data warehouse engines. This poses a challenge, especially when the market demands swift actions to stay ahead of competitors.

Fortunately, the introduction of the data lake addresses these challenges. A data lake can efficiently store vast volumes of data at a low cost, including unstructured data. Unlike traditional data warehouses, a data lake adopts a schema-on-read approach, allowing data to be stored without predefined schema definitions. To process this data, Apache Spark is often employed, working on a compute cluster to handle large-scale data processing.

While a modern data lake may still require data cleansing and file format unification, these tasks are generally less labor-intensive than activities associated with a full ETL (extract, transform, load) process. This flexibility supports the rapid onboarding and analysis of new data sources.

In this setup, the data engineering team is responsible for ingesting, cleaning, and enriching data in the data lake. Subsequently, the data science team can analyze this data to extract vital business information, aiding in making informed decisions or launching new products. Another notable feature of the data lake is its centralization within the organization, facilitating the breakdown of data silos, promoting data democratization, and reducing duplication of data solutions in large organizations.

In this evolved data platform, it’s important to note that the data lake doesn’t replace the data warehouse; instead, the two can coexist, complementing each other in a two-tier architecture. The data lake excels at storing vast volumes of diverse and unstructured data with a schema-on-read approach, facilitating rapid onboarding and analysis of new data sources. Meanwhile, the data warehouse remains pivotal for storing aggregated and modeled data. This structured data from the data warehouse can seamlessly integrate with business intelligence tools, providing a comprehensive solution for analytical purposes. The symbiotic relationship between the data lake and data warehouse enhances the organization’s ability to handle diverse data types and meet both real-time and structured analytical needs.

Level 3

An alternative approach to the two-tier architecture is the lakehouse, where a data model is built directly over a data lake without the need for an additional engine for a data warehouse. The intricacies and distinctions between these architectures are explored in detail in my article ’Data Lakehouse vs Data Warehouse vs Data Lake — Comparison of Data Platforms‘

Level 3 — Lakehouse

An integral aspect of this advanced architecture is the incorporation of streaming analytics, which necessitates the adoption of new tools like Apache Kafka or public cloud native solutions such as Azure Event Hub, Azure Stream Analytics, GCP Dataflow, Pub/Sub, and Amazon Kinesis. Streaming analytics provides a real-time perspective on data and processes, enabling organizations to monitor activities in real time and respond swiftly compared to traditional batch processing systems, where data is ‘refreshed’ once or a few times a day.

Implementing streaming allows for the creation of processes that transmit changes from transaction systems in real time. For example, in a finance company, this enables the rapid response to suspicious transactions within seconds. This kind of process facilitates the establishment of a ‘Change Data Capture’ (CDC) process, streaming changes from a transaction system database.

Another avenue is the utilization of Internet of Things (IoT) processes, which can be employed to gather information from sensors in various environments, such as factories. This approach allows for the detection of unwanted events like equipment failures and environmental changes such as temperature fluctuations or pipeline leaks.

Level 4

At Level 4 of the data platform evolution, we leverage state-of-the-art technologies, prominently featuring the use of data lakes or lakehouses to store vast datasets in modern formats like Delta Lake, Iceberg, Apache Hudi, or Parquet. These formats empower data engineers to store data in a highly compressed columnar format, supporting analytical queries, transactions, and various merge/update/delete operations (excluding Parquet). The adoption of these formats represents a significant advancement in data storage efficiency.

Furthermore, our platform excels in analyzing real-time data feeds from diverse sources through streaming technologies. While presenting real-time insights in reports, we’ve also embarked on harnessing the power of Machine Learning (ML) models. These models play a crucial role in detecting anomalies, predicting equipment failures, identifying fraudulent activities, forecasting sales trends, and classifying clients.

At this advanced level, decision-making goes beyond relying solely on current numbers; we integrate forecasts derived from machine learning models. This transformative approach enables us to not only respond to present circumstances but also proactively plan based on predictive analytics.

Level 4

Machine Learning Models

Machine learning models come in various types, each designed for specific tasks:

  • Classification: Used for pattern recognition, it can label entities like client types, images, and document categories.
  • Regression: Investigates the relationship between independent variables (features) and a dependent variable (outcome), commonly employed for flat price prediction, sales forecasting, and similar applications.
  • Clustering: Groups similar data points into clusters or unlabeled groups, beneficial for tasks like anomaly detection and market segmentation.
  • Time-series: Predicts future numerical values based on time-series data, applicable to scenarios such as forecasting sales, demand, calls, or web traffic.

To deploy a machine learning process, the following steps are typically involved:

  1. Define the use case: Collaborate with business users to determine what the model should predict or recognize.
  2. Get the data: Identify data sources and ingest the data into a data lake or lakehouse.
  3. Prepare the data: Explore and clean the data, transforming it according to the model’s requirements.
  4. Train the model: Choose an algorithm and hyperparameters, then evaluate the results.
  5. Deploy the model: Utilize the trained model to generate the requested results.

Training a model can be accomplished using open-source libraries and on-premise or public cloud infrastructure. For instance, in Python, tools like Pandas, Scikit-Learn, PyTorch, and TensorFlow can be used. Public cloud services like Azure Machine Learning, BigQuery ML, Vertex AI, Amazon SageMaker, and Redshift ML are also valuable options. Additionally, machine-learning models can be dockerized and hosted in the public cloud. To orchestrate ML processes, tools such as Azure Data Factory, Airflow, Prefect, GCP Workflows, and AWS Step Functions can be employed.

AI services and Generative AI

The utilization of public cloud services grants access to ready-to-use AI services, featuring pre-trained models with customizable APIs. Examples of these services span various domains, including natural language processing for conversations, search, monitoring, translation, speech, and vision. Leveraging these services empowers businesses to develop cutting-edge products that automate and enhance various processes. For instance, speech-to-text conversion can be employed to analyze customer questions, and vision services can be used to analyze images.

While traditional AI solutions focused on understanding and recommending information, the new generation of Generative AI takes a step further by enabling the creation of entirely new content. Built on technologies like large language models (LLMs), which are trained on extensive textual data, Generative AI has the capacity to generate not only text but also images, videos, and audio. As of the current writing, notable Generative AI models include Chat GPT, OpenAI, Gemini, Bard, and Duet AI. These services can be utilized as built-in assistance or integrated into applications. In the context of built-in assistance, they can enhance various tasks such as supporting query writing in BigQuery, aiding code creation in Visual Studio Code, or improving marketing materials.

Level 5

While we may not have reached Level 5 yet, the ongoing innovation in the AI domain, especially in Generative AI, is positioned to influence the evolution of data platforms. As progress persists, I foresee a situation where a significant share of tasks and analyses will incorporate effortless engagement with AI. Users might raise questions about intriguing data, receiving prompt and insightful answers. This evolution holds the promise of vastly improving access to information within organizations. TThe only query remains: to what extent can we push the boundaries of AI utilization and what are the associated costs? The intersection of AI capabilities and cost-effectiveness will likely define the boundaries of this transformative journey.

Type 5 — my hypothetical assumption

Data Mesh and Fabric

In the ongoing evolution of analytical platforms, it’s crucial to consider emerging architectural trends, with particular attention to concepts like Data Mesh and Fabric. Data Mesh, introduced by Zhamak Dehghani in 2019, proposes a decentralized data platform model where domain-oriented teams are responsible for delivering, maintaining, and ensuring the quality of their respective data products. The idea is to foster collaboration between domain teams, improving development speed and allowing teams to focus on their areas of expertise. While some publications suggest that Data Mesh represents a new generation of data platforms, it’s essential to acknowledge that its decentralized nature may be more suitable for larger organizations with multiple data teams. In contrast, it may not be as effective in medium or small organizations where a single data team handles all reporting needs.

On the other hand, Fabric emphasizes the centralization of activities associated with a data platform, encompassing data ingestion, ETL, reporting, and data governance. This centralized approach aims to address challenges related to data silos, promote data sharing, and enhance collaboration. However, it’s essential to be mindful that a centralized model might introduce bottlenecks in delivery, especially as an organization experiences rapid growth.

For organizations considering both concepts, a careful evaluation of the benefits and disadvantages of each is crucial. It’s important to select the approach that aligns best with the organization’s specific needs and challenges, rather than blindly following the latest trends.

Summary

The development of a data platform involves a combination of various indicators, including architecture approach, data governance, scalability, real-time data processing, big data processing, advanced analytics, and more. Our journey through different stages of data platform development reveals that starting with a simple solution and progressively adding new components or advanced analytics such as ML and AI allows us to adapt and cover evolving business needs.

In the realm of data platform development, measurement is achieved through data and technology utilization. If you solely rely on one transaction system for queries, you are at the initial stage of this journey. However, incorporating ML and AI places you at the forefront, enabling you to address more sophisticated business needs and outperform competitors. Of course, the extent of your possibilities is closely linked to your organization’s size and development. Moving to a higher level involves not only the implementation of technology but also fostering “data education” within your company.

If you found this article insightful, I invite you to express your appreciation by liking it on LinkedIn and clicking the ‘clap’ button. Your support is greatly valued. For any questions or advice, feel free to reach out to me on LinkedIn.

--

--