Writing critical syslog events to Apache Iceberg for analysis

Tim Spann
Cloudera
Published in
3 min readJun 25, 2023

--

A few weeks have passed since you built your data flow with DataFlow Designer to filter out critical syslog events to a dedicated Kafka topic. Now that everyone has better visibility into real-time health, management wants to do historical analysis on the data. Your company is evaluating Apache Iceberg to build an open data lakehouse and you are tasked with building a flow that ingests the most critical syslog events into an Iceberg table.

Photo by Danting Zhu on Unsplash

Ensure your table is built and accessible.

Create an Apache Iceberg Table

  1. From the Home page, click the Data Hub Clusters. Navigate to oss-kudu-demo from the Data Hubs list
  2. Navigate to Hue from the Kudu Data Hub.
  3. Inside of Hue you can now create your table. You will have your own database to work with. To get to your database, click on the ‘<’ icon next to default database. You should see your specific database in the format: <YourEmailWithUnderscores>_db. Click on your database to go to the SQL Editor.
  4. Create your Apache Iceberg table with the sql below and clicking the play icon to execute the sql query. Note that the the table name must prefixed with your Work Load User Name (userid).

CREATE TABLE <<userid>>_syslog_critical_archive

(priority int, severity int, facility int, version int, event_timestamp bigint, hostname string,

body string, appName string, procid string, messageid string,

structureddata struct<sdid:struct<eventid:string,eventsource:string,iut:string>>)

STORED BY ICEBERG;

  1. Once you have sent data to your table, you can query it.

Additional Documentation

2.1 Open ReadyFlow & start Test Session

  1. Navigate to DataFlow from the Home Page
  2. Navigate to the ReadyFlow Gallery
  3. Explore the ReadyFlow Gallery
  4. Search for the “Kafka to Iceberg” ReadyFlow.
  5. Click on “Create New Draft” to open the ReadyFlow in the Designer named yourid_kafkatoiceberg Ex: tim_kafkatoiceberg
  6. Start a Test Session by either clicking on the start a test session link in the banner or going to Flow Options and selecting Start in the Test Session section.
  7. In the Test Session creation wizard, select the latest NiFi version and click Start Test Session. Notice how the status at the top now says “Initializing Test Session”.

2.2 Modifying the flow to read syslog data

The flow consists of three processors and looks very promising for our use case. The first processor reads data from a Kafka topic, the second processor gives us the option to batch up events and create larger files which are then written out to Iceberg by the PutIceberg processor.
All we have to do now to reach our goal is to customize its configuration to our use case.

  1. Provide values for predefined parameters
  2. Navigate to Flow Options→ Parameters
  3. Select all parameters that show No value set and provide the following values

Name

Description

Value

CDP Workload User

CDP Workload User

<Your own workload user name>

CDP Workload User Password

CDP Workload User Password

--

--

Tim Spann
Cloudera

Principal Developer Advocate, Zilliz. Milvus, Attu, Towhee, GenAI, Big Data, IoT, Deep Learning, Streaming, Machine Learning. https://www.datainmotion.dev/