End-to-End Real-Time Data Engineering: A Project Template Guide with Kafka, Spark, PostgreSQL, and Dash | Part 3: Interactive Visualizations with Plotly Dash

Fatih KIR
9 min readMay 19, 2024

--

Final Dashboard

Introduction

In the first part of our series, “End-to-End Real-Time Data Engineering,” we laid the foundation for our data pipeline by setting up data production and messaging using Kafka. We created a Kafka Python producer to simulate log data and configured Kafka with Docker Compose. Moving on to Part 2, we delved into real-time data processing and storage by integrating Apache Spark for stream processing and PostgreSQL for persistent storage.

Now, in Part 3, we will complete our real-time data engineering journey by creating interactive visualizations using Plotly Dash and build the dashboard in the above GIF. Interactive dashboards play a crucial role in making data insights accessible and actionable, allowing stakeholders to monitor key metrics in real time.

Here’s what we’ll cover in the last part of these series:

  • Building the Dash Application: We’ll develop the logic for our Dash application, connecting it to our PostgreSQL database to read and visualize real-time data. This will involve creating interactive charts and graphs to monitor key metrics.
  • Setting Up Plotly Dash with Docker Compose: We’ll add a Dash service to our Docker Compose setup, ensuring seamless integration with our existing services. This approach simplifies deployment and ensures consistency across different environments.
  • Deploying and Accessing the Dashboard: After building the Dash application, we’ll deploy it using Docker Compose and demonstrate how to access the dashboard through a web browser. We’ll verify that the real-time visualizations are working correctly.

Building the Dash Application

Our dash application would require the application script as app.py which will consists of the main logic for the dashboard, a requirements.txt file for the dependencies that needs to be installed inside the docker compose environment and a Dockerfile that will enable us to integrate our Dash service to the other services. Create a separate folder for this service for modularity of our program. The folder structure should look like this:

Let`s start with the app.py by importing necessary libraries:

import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import dash_bootstrap_components as dbc
import plotly.graph_objs as go
import pandas as pd
import sqlalchemy
from dash_bootstrap_templates import load_figure_template
  • dash, dcc, html, Input, Output: The core components and functions from the Dash library. dcc (Dash Core Components) and html (Dash HTML Components) are used to create interactive user interfaces. Input and Output are used for setting up callbacks to update the app dynamically.
  • dash_bootstrap_components: Provides Bootstrap-themed components for Dash applications. It helps in creating visually appealing layouts with minimal effort.
  • plotly.graph_objs: Part of the Plotly library, which is used for creating interactive plots and charts.
  • pandas: A powerful data manipulation and analysis library for Python.
  • sqlalchemy: A SQL toolkit and Object-Relational Mapping (ORM) library for Python. It is used here to manage database connections and queries.
  • dash_bootstrap_templates: Used to load Bootstrap themes for the Dash app.

Then, load the theme for the dashboard and initialize the application:

# Load the dark theme for the entire app
load_figure_template("DARKLY")

app = dash.Dash(__name__, external_stylesheets=[dbc.themes.DARKLY])
  • load_figure_template(“DARKLY”): Loads the “DARKLY” theme for the entire app, giving it a dark mode appearance.
  • dash.Dash: Initializes the Dash application with the “DARKLY” Bootstrap theme.

Now, we can define the layout of our application:

app.layout = html.Div(
[
dbc.NavbarSimple(
brand="Real-time Application Analysis",
brand_href="#",
color="dark",
dark=True,
),
dbc.Container(
[
dbc.Row(
[
dbc.Col(dcc.Graph(id="app-1-gauge"), width=4),
dbc.Col(dcc.Graph(id="app-2-gauge"), width=4),
dbc.Col(dcc.Graph(id="app-3-gauge"), width=4),
],
align="center",
),
dbc.Row([dbc.Col(dcc.Graph(id="error-trends"), width=12)]),
dbc.Row(
[
dbc.Col(dcc.Graph(id="latency-metrics"), width=6),
dbc.Col(dcc.Graph(id="request-distribution"), width=6),
]
),
dcc.Interval(
id="interval-component", interval=10 * 1000, n_intervals=0
),
],
fluid=True,
),
]
)
  • html.Div: A container for all HTML elements.
  • dbc.NavbarSimple: A simple navigation bar with a brand name “Real-time Application Analysis” and a dark color theme.
  • dbc.Container: A Bootstrap container to hold the layout elements, set to be fluid to take the full width of the screen.
  • dbc.Row: A Bootstrap row to organize the layout into columns.
  • dcc.Graph: Dash core component to render interactive graphs.
  • dcc.Interval: A component to update the graphs at regular intervals, here set to update every 10 seconds.

After that, we should define a callback function to fill this layout:

@app.callback(
[
Output("app-1-gauge", "figure"),
Output("app-2-gauge", "figure"),
Output("app-3-gauge", "figure"),
Output("error-trends", "figure"),
Output("latency-metrics", "figure"),
Output("request-distribution", "figure"),
],
Input("interval-component", "n_intervals"),
)
def update_metrics(n):
# Create database connection
engine = sqlalchemy.create_engine("postgresql://admin:admin@postgres:5432/logs")

# Using context managers for handling database connections
with engine.connect() as conn:
df = pd.read_sql(
"SELECT * FROM application_metrics ORDER BY enddate DESC LIMIT 3", conn
)

error_data = pd.read_sql(
"""
SELECT application_id, enddate, error_rate
FROM application_metrics
ORDER BY application_id, enddate
LIMIT 300;
""",
conn,
)
  • @app.callback: A decorator to define the callback function that updates the graphs. It takes the output components (the graphs) and input components (the interval component).
  • update_metrics(n): The function that fetches the data from the PostgreSQL database and updates the graphs. It is triggered every time the interval component updates.
  • sqlalchemy.create_engine: Creates a connection to the PostgreSQL database.
  • pd.read_sql: Reads data from the database into a pandas DataFrame.

After handling the db connections and retrieve the data, we can define our visualizations. Lets, start with the gauges at the top of the dashboard:

    # Create gauges for the latest error rates as percentages
gauges = []
for i in range(1, 4):
app_data = df[df["application_id"] == f"app_{i}"]
error_rate = app_data["error_rate"].values[0] if not app_data.empty else 0
gauge = go.Figure(
go.Indicator(
mode="gauge+number",
value=error_rate,
domain={"x": [0, 1], "y": [0, 1]},
title={"text": f"App {i} Error Rate", "align": "center"},
gauge={"axis": {"range": [None, 100]}},
)
)
gauge.update_layout(template="plotly_dark")
gauges.append(gauge)
  • gauges: A list to hold the gauge figures.
  • for i in range(1, 4): Loop through three applications.
  • app_data = df[df[“application_id”] == f”app_{i}”]: Filter the DataFrame for each application.
  • go.Indicator: A Plotly component to create gauge charts.
  • gauge.update_layout: Apply the “plotly_dark” template to the gauge chart.

Now continue with the line chart on the middle:

    # Error Trends Plot
error_trends = go.Figure()
for app_id in error_data["application_id"].unique():
app_data = error_data[error_data["application_id"] == app_id]
error_trends.add_trace(
go.Scatter(
x=app_data["enddate"],
y=app_data["error_rate"],
mode="lines+markers",
name=f"App {app_id}",
)
)

error_trends.update_layout(
title="Error Rate Trends Over Time",
xaxis_title="Time",
yaxis_title="Error Rate (%)",
legend_title="Application ID",
template="plotly_dark",
)
  • error_trends = go.Figure(): Initialize a Plotly figure for error trends.
  • for app_id in error_data[“application_id”].unique(): Loop through each unique application ID.
  • go.Scatter: A Plotly component to create scatter plots.
  • error_trends.add_trace: Add each application’s error rate data as a trace to the figure.
  • error_trends.update_layout: Set the layout properties, including titles and the “plotly_dark” template.

And finalize our plots with the grouped bar charts at the end:

    # Latency Metrics Plot
latency_metrics = go.Figure(
go.Bar(
x=df["application_id"],
y=df["average_latency"],
text=df["average_latency"],
textposition="auto",
)
)
latency_metrics.update_layout(
title="Average Latency per Application",
xaxis_title="Application ID",
yaxis_title="Latency (ms)",
template="plotly_dark",
)

# Request Distribution Plot
request_distribution = go.Figure()
request_types = ["get_requests", "post_requests", "put_requests", "delete_requests"]
for request_type in request_types:
request_distribution.add_trace(
go.Bar(
x=df["application_id"],
y=df[request_type],
name=request_type.split("_")[0].upper(),
)
)

request_distribution.update_layout(
barmode="group",
title="Request Distribution per Application",
xaxis_title="Application ID",
yaxis_title="Number of Requests",
template="plotly_dark",
)
  • latency_metrics = go.Figure(go.Bar()): Initialize a Plotly figure with a bar chart for latency metrics.
  • go.Bar: A Plotly component to create bar charts.
  • latency_metrics.update_layout: Set the layout properties, including titles and the “plotly_dark” template.
  • request_distribution = go.Figure(): Initialize a Plotly figure for request distribution.
  • for request_type in request_types: Loop through each request type (GET, POST, PUT, DELETE).
  • go.Bar: Create a bar for each request type.
  • request_distribution.add_trace: Add each request type as a trace to the figure.
  • request_distribution.update_layout: Set the layout properties, including titles and the “plotly_dark” template.

Finally return all our plots with:

    return gauges + [error_trends, latency_metrics, request_distribution]

This is it for our callback function. In your applications, you can change the callback function and layout to create interactive and live dashboards. For the last touch of our app.py file we need call our application and initialize it by adding these lines at the end of the script:

if __name__ == "__main__":
app.run_server(debug=False, host="0.0.0.0")
  • if name == “main”: Ensure the server runs only if this script is executed directly.
  • app.run_server(debug=False, host=”0.0.0.0"): Start the Dash app server with debug mode off and host set to "0.0.0.0" to make it accessible within the Docker network to interact with other components.

The major part of the dash application is done already. All we need to add is to add this requirements.txt file and add a Dockerfile to build the docker component for our service in our overall docker compose:

dash
pandas
plotly
sqlalchemy
psycopg2-binary
dash-bootstrap-components
dash-bootstrap-templates

Dockerfile:

# Use an official Python runtime as a parent image
FROM python:3.9-slim

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Make port 8050 available to the world outside this container
EXPOSE 8050

# Run app.py when the container launches
CMD ["python", "app.py"]

That is it. Our dash service is ready now. All we need to do is to add this service into the docker compose.

Setting Up Plotly Dash with Docker Compose

In order to add this service to the docker-compose.yml and interact with other services that we created earlier, we just need to add the service definition into the services part of our docker-compose like this:

  dash_app:
build: ./dash_app
ports:
- "8050:8050"
depends_on:
- kafka
- spark-master
- spark-worker
- kafka-producer
- postgres
- spark_job
networks:
bridge:
aliases:
- dash_app
  • dash_app:The name of the service. This name is used to reference the service within the Docker Compose file and network.
  • build: ./dash_app:Specifies the build context for the service. The path ./dash_app indicates the directory containing the Dockerfile and application code for the Dash service. Docker Compose will use this directory to build the Docker image for the Dash service.
  • ports: - "8050:8050":Maps port 8050 on the host machine to port 8050 on the container. This makes the Dash application accessible via http://localhost:8050 on the host machine.
  • depends_on:Lists the services that the Dash service depends on. Docker Compose ensures that these services are started before the Dash service.
  • - kafka:The Kafka broker service, which is part of the data pipeline.
  • - spark-master:The master node of the Spark cluster, which coordinates the Spark jobs.
  • - spark-worker:The worker node(s) of the Spark cluster, which execute the Spark tasks.
  • - kafka-producer:The custom service we have built that sends data into Kafka.
  • - postgres:The PostgreSQL database service, which stores the processed data.
  • - spark_job:The Spark streaming job that processes data from Kafka and writes to PostgreSQL.
  • networks:Specifies the network(s) that the service is connected to.
  • bridge:Indicates that the service is connected to the bridge network. This is a custom Docker network defined in the Docker Compose file, allowing services to communicate with each other.
  • aliases: - dash_app: Sets dash_app as a network alias for the service within the bridge network. This alias allows other services on the same network to refer to this service by the name dash_app.

With the addition of our service to the docker compose, we can now run our application.

Deploying and Accessing the Dashboard

We have everything set up and now all we have to do to initialize all our services is to run:

docker-compose up --build

This will initiate all the services one by one. You can access the dashboard at localhost:8050. You may need to wait the services to start and stream the data into the database and the dashboard.

Final words

In this comprehensive series, “End-to-End Real-Time Data Engineering,” we built a robust real-time data pipeline using Kafka, Spark, PostgreSQL, and Dash. Our goal was to create an end-to-end solution that handles real-time data ingestion, processing, and visualization.

In Part 1, we focused on data production and messaging. We set up Kafka and created a Kafka Python producer to simulate log data, establishing a scalable messaging infrastructure.

In Part 2, we moved to real-time data processing and storage. We integrated Apache Spark for stream processing and used PostgreSQL for persistent storage. Our Spark Streaming application analyzed the data produced by our Kafka producers.

In Part 3, we developed interactive visualizations with Plotly Dash. We built a Dash application connected to our PostgreSQL database, visualizing data through dynamic charts and graphs. We integrated the Dash service into our Docker Compose setup, ensuring seamless deployment.

Throughout this series, we:

  • Built the Dash Application: Developed the logic for our Dash app, connected it to PostgreSQL, and created interactive charts and graphs.
  • Set Up Plotly Dash with Docker Compose: Added the Dash service to Docker Compose, ensuring integration with existing services and simplifying deployment.
  • Deployed and Accessed the Dashboard: Deployed the Dash application, accessed it via a web browser, and verified the real-time visualizations.

By following this series, you now have a complete template for building a real-time data engineering pipeline. This template can be customized and expanded for specific needs.

Thank you for joining this journey. I hope you found this series informative and practical. Connect with me on LinkedIn for more discussions on data engineering. The full project is available on GitHub. Let’s keep exploring and innovating in real-time data engineering!

--

--