Optimizing Data Pipelines with Cribl Stream: An Overview

Muhammed Yusuf Özkan
Trendyol Tech
Published in
8 min readJan 17, 2024

In today’s data-driven realm, effective management, processing, and utilization of data is essential for cybersecurity. It is important for cyber security teams to collect and audit a wide range of data sources to stay on top of the threat landscape. Cyber security teams of organizations have to make sure that every critical system or application is logged and monitored for potential threats.

Appliances, servers, applications, and third-party products need to be audited for both security and regulatory reasons. Managing these data flows for big organizations by itself, introduces complexity. Processing, transforming, filtering, and routing the data coming from a myriad number of sources is a full time job.

In recent years, we have seen new products enter the market aimed at making this process a little easier. The Cribl Stream is one of them. In this article, we will take a quick look at Cribl Stream and its capabilities.

So, What is Cribl Stream?

Cribl is relatively new in the Data Streaming space. It has a suite of products called Cribl Edge, Cribl Stream, Cribl Search, and Cribl Cloud. The particular interest of this article is Cribl Stream.

Cribl Stream is a data streaming tool designed to assist in managing, processing, and routing data. It is placed between your data sources and destinations and lets you collect, filter, clone, and shape your data before routing them to designated destination(s).

What makes Cribl Stream impressive is;

  • ingestion from a wide variety of systems and applications
  • data process and transformation capabilities
  • flexible and easily manageable data route management
  • wide variety of routing options in terms of platforms, tools, and systems
  • ease of scalability
Data Flow with Cribl Stream (cribl.io)

In an example scenario, we can forward data from a Splunk Universal Forwarder to Cribl Stream, filter, enrich, and transform the data, and then send a part of it to S3 buckets while the remaining part to Splunk Indexers. All on UI, while not editing a single configuration file.

Getting Data In

A full view of the supported data input methods can be seen in the pictures below.

Input methods 1
Input methods 2

We can see that all common data input methods are available, such as; Syslog, file monitoring, SNMP, HTTP, S3 bucket read, database, and scripted inputs. Also, it is possible to collect data from other data platforms. We can export data from Splunk Search Head or ElasticSearch API.

What sets Cribl Stream apart is its ability to collect logs from other platform’s agents, Splunk Universal Forwarder, Beats, OpenTelemetry etc. This makes it possible for us to consider Cribl Stream as a Splunk Heavy Forwarder or Logstash replacement.

For example, on this menu, we’ll opt for the Splunk TCP to collect logs from Splunk Universal Forwarder.

Setting up Splunk TCP Input

Setting up Splunk TCP input is pretty straightforward. We have to name the input to use it in pipelines or routes. Address sections let us whitelist host IPs that can send data through this port.

Setting up persistent queue

It is possible to set up a persistent Queue for this input with this section. Meaning that, if for any reason the destination is not available, Stream can store data until it reaches the “Max queue size”. Setting up event breakers, fields, and pre-processing is also a breeze with the sections on the left-hand side.

After setting up Cribl Stream to listen to Splunk TCP data on port 9997, we need to configure Splunk Universal Forwarders to send data to Stream instead of Splunk Indexers.

Configuring outputs.conf

We can check Live Data under the Data > Sources section to see if we’re receiving data from Splunk Universal Forwarders. We can also capture and save a portion of the data to later test our pipelines with the “Capture” button.

Checking data input via Live Data

Pipelines, Routes and Destinations

After getting data in via sources, we’ll be using Routes and Pipelines before shipping data to destinations.

An overview of the architecture looks like this:

An overview of Cribl Stream in detail

Pipelines are the operations we define to transform, filter, and enrich data. We can create pipelines via functions that are predefined by Cribl. Pipelines are attached to Routes.

We can think of Routes as production lines for data, in which we’re defining transformations(via Pipelines) that will be applied to data, and the destination of data after transformation.

Routes can have multiple Pipelines attached to them, also, they can send data to multiple Destinations.

Setting up Destinations

Before setting up Pipelines and Routes, we need to set up some destinations. There are plenty to choose from.

Destinations 1
Destinations 2

For demonstration purposes, we’ll be configuring 2: Syslog and Splunk Indexers.

Splunk Load Balanced Destination configuration

Here, we can configure Splunk Load Balanced Destination to forward data into Splunk Indexers. As we can define Indexer IPs manually, we can also enable indexer discovery. In this case, we will use Cluster Master IP instead. Setting up a Persistent Queue is also possible in this window. The difference is, that processed data is persisted instead of pre-processed data, in case of a problem with Splunk Indexers.

Configuration screen for the Syslog Destination can be seen below:

Syslog Destination configuration

After setting up Destinations, we can see their status inside the “Manage Destinations” screen:

Destination Status

Creating Pipelines

Having set up Sources and Destinations, we can start to create pipelines.

There are predefined pipelines that come out of the box with Cribl Stream. Generally, the naming of the pipelines is pretty self-explanatory:

devnull: Sends logs to null queue. We can use the devnull pipeline to eliminate logs before sending them to a destination.

main: Default pipeline out-of-the-box, we can configure this if we are planning to do a default transformation for sources.

passthru: As the name suggests, passes logs through without any transformation.

If we want to create a Pipeline, we are greeted with the screen below:

Pipeline Creation Page

The left side of the page is where we define functions that will transform or filter the data. The right side is there to see the real-time output of the defined functions. We can capture real-time data to preview our Pipeline, or we can use previously captured data from the “Samples” section. There is also “Datagens” to generate sample data. After choosing a Sample Data, we can see the preview of the Pipeline by clicking the “Sample Preview” or “Full Preview” buttons.

There are plenty of functions that we can use to create a Pipeline, a complete list of them, and their functionality can be found here.

The left side of the page is where we define functions that will transform or filter the data. The right side is there to see the real-time output of the defined functions. We can capture real-time data to preview our Pipeline, or we can use previously captured data from the “Samples” section. There’s also a “Datagens” to generate sample data. After choosing a Sample Data, we can see the preview of the Pipeline by clicking the “Sample Preview” or “Full Preview” buttons.

Creating A Pipeline on Cribl Stream.

Above, we created a pipeline for Splunk internal apps, which masks IP addresses for fields _raw and host_ip (not sure why but hey, again, for demonstration purposes) and extracts the component info into a new field. The effect of the pipeline can be observed from the panel on the right. Neat…

We can also create a new pipeline for the Linux internal audit logs.

Here, we are also masking IP addresses and extracting the comm field for the sourcetype “linux:audit”. After saving our newly created pipelines. We are ready to set up our Routes.

Setting Up Routes

I mentioned that Routes are like production lines for data. With help of them, we are deciding where the data goes and what operations are done on them (pipelines). In the below screenshot, we can see a sample configuration of a pipeline.

Sample configuration of a Cribl Stream Pipeline

On the first route, we are filtering the sourcetype “splunkd” from the Splunk TCP input and applying the pipeline that we’ve created for Splunk internal logs. After that, we are routing that to Splunk Indexers to be ingested by Splunk. Then, we filter “linux:audit” logs from the Splunk TCP input and apply the pipeline that we created for linux audit logs, and forward linux audit logs to the syslog output that we’ve created.

The main takeaway here is that we can filter the exact data we like and apply the operations we want.

Wrapping Up

Cribl Stream is a powerful tool that makes data management, transformation, and routing easy. Moving the difficult part of data transformation outside of your data platform also prevents vendor lock-in. Its flexibility for integration with various systems and platforms, combined with the many functions it provides to transform data on a user-friendly UI makes it compelling.

Join Us

We’re building a team of the brightest minds in our industry. Interested in joining us? Visit the pages below to learn more about our open positions.

https://jobs.lever.co/trendyol

--

--