Network Insights in a Distributed Environment

9 min readFeb 6, 2023

By: Kalman Meth and Eran Raichstein
6 February 2023

Suppose we run an online e-shop with presence in two or three data centers and customers all around the world. We want to extend our geographical presence to improve latency and response time, to meet customers’ demand. It would be greatly beneficial if there could be a way to understand the customers’ network behavior against our e-shops so we can extend efficiently with more instances of the e-shop in optimized locations. We want to know where our customers are located, what latency they experience, how much traffic and how many connections we get from each location, etc. It would be nice if our network toolbox could provide us with a map of our customer connections, and various metrics for the information required. For example, a map could look like the following figure.

In this map, we see the locations of customers who connected to our e-shop. How did we obtain this map? What do we need from our network toolbox to gain this insight?

Flowlogs Pipeline

In most computing environments, various types of operational data are continuously generated for diverse purposes such as performance monitoring and analysis, detection of anomalous behavior, regulatory requirements, etc. Among the countless types of network related operational data (bulk counters, packet capture, network device logs) flow logs are:

More informative than bulk counters while more concise and viable than full packet capture;
Often must be anyway collected due to regulatory and compliance requirements, such as auditing.

Typically flow logs are generated by NetFlow collectors that capture data from either network devices or using hooks to the software (e.g. eBPF).

Flowlogs Pipeline (a.k.a. FLP) is an observability tool that consumes logs from various inputs, transforms them, and exports logs and/or metrics to a chosen target (e.g. export logs to loki and/or time series metrics to prometheus). This data may then be used to perform various kinds of analyses for performance, security, accounting, alerting, or other purposes. We use the term pipeline to describe the sequence of operations performed on the data: ingest, transform, extract, encode, etc. The use of a multi-phase pipeline makes it easier to reuse specific phases while implementing new features.

In the above example to obtain customer locations of an online service, FLP is used to take provided flow logs and generate customized metrics ready for consumption by visualization and other tools.

FLP is light-weight, implemented in Go, available as open source in github and is used in the recently released OpenShift Network Observability Operator. It has out-of-the-box enrichment of kubernetes data, and is easily configurable to provide a variety of metrics to provide insight into what is happening in your network.

Architecture

The pipeline is constructed of a sequence of stages. Each stage is classified into one of the following types:

ingest — obtain flows from some source, typically one entry per line;
transform — convert entries into a standard format; enrich the entry with contextual information; may include multiple transform stages;
extract — derive a set of metrics from the ingested flows;
encode/write — provide the means to write the data to some target, e.g. loki, prometheus, kafka, object store, standard output, etc.

The first stage in a pipeline must be an ingest stage. Each stage (other than an ingest stage) specifies the stage it follows. Multiple stages may follow from a particular stage, thus allowing the same data to be consumed by multiple parallel pipelines. For example, multiple transform stages may be performed and the results may be output to different targets.

Let’s look at our online service example.

(1) In this example, data is ingested in IPFIX format by the first stage (ingest_collector).
(2) The data is passed to a transform stage (transform_generic) to convert the names of some data fields to standardized names. Note that multiple flow log entries may apply to data associated with the same connection.
(3) The data is then passed to a data enrichment stage (transform_network) which adds fields related to subnets, kubernetes related identifiers, etc, to each of the flow log entries.
(4), (5) The resulting data is then forwarded to two distinct stages (write_loki and extract_aggregate).
(4) One stage (write_loki) simply outputs the flow logs to a loki repository.
(5) The other stage (extract_aggregate) performs some analysis of the flow logs and aggregates them according to some rules.
(6) The result of this aggregation is then forwarded to a stage (encode_prom) to provide the data in a format to be consumed by prometheus.

The parameters to customize each of the defined stages are specified in a yaml configuration file. By providing different configuration parameters to stages and by ordering stages in a desired order, a user can obtain different types of metrics aggregations and analytics.

Configuration

A configuration file consists of two sections. The first section (called pipeline) describes the high-level flow of information between the stages, giving each stage a name and building the graph of consumption and production of information between stages. The second section (called parameters) provides the definition of specific configuration parameters for each one of the named stages. A configuration file for the example above might look like the following (with some explanations provided below).

pipeline:
- name: ingest_collector
- name: transform_generic
  follows: ingest_collector
- name: transform_network
  follows: transform_generic
- name: extract_aggregate
  follows: transform_network
- name: encode_prom
  follows: extract_aggregate
- name: write_loki
  follows: transform_network
parameters:
- name: ingest_collector
  #<ingest_collector configuration parameters>
- name: transform_generic
  #<transform_generic configuration parameters>
- name: transform_network
  #<transform_network configuration parameters>
- name: extract_aggregate
  #<extract_aggregate configuration parameters>
- name: encode_prom
  #<encode_prom configuration parameters>
- name: write_loki
  #<write_loki configuration parameters>

For our specific example, suppose the provided flow log entries look like the following.

{
  "Bytes": 20800,
  "DstAddr": "10.128.2.13",
  "DstMac": "0a:58:0a:80:02:0d",
  "DstPort": 38222,
  "Etype": 2048,
  "FlowDirection": 1,
  "Packets": 400,
  "Proto": 6,
  "SamplingRate": 0,
  "SequenceNum": 13751,
  "SrcAddr": "10.129.2.5",
  "SrcMac": "0a:58:0a:80:02:01",
  "SrcPort": 9154,
  "TCPFlags": 0,
  "TimeFlowStart": 0,
  "TimeReceived": 1637501830,
  "Type": 4
}

The fields that are most relevant to our discussion are: SrcAddr, SrcPort, DstAddr, DstPort, Proto, Bytes, Packets. The subsequent stages take these fields and manipulate them to produce more insightful metrics.

Discussion of some of the configuration parameters

A complete description of configuration parameters can be found in the FLP github repo under api. We highlight here a subset of the configuration options.

Transform Network

The transform_network stage provides specific functionality that is useful for transformation of network flow logs. These include:

Resolve subnet from IP addresses;
Resolve known network service names from port numbers and protocols;
Compute geo-location from IP addresses;
Resolve kubernetes information from IP addresses.

The resulting information is added as additional fields in the flow log entry.

The configuration parameters for transform_network looks like the following.

- name: transform_network
  transform:
    type: network
    network:
      rules:
      - input: dstPort
        output: service
        type: add_service
        parameters: proto
      - input: srcAddr
        output: srcSubnet
        type: add_subnet
        parameters: /16
      - input: srcAddr
        output: srcK8S
        type: add_kubernetes
        Parameters: srcK8S_labels
      - input: srcAddr
        output: srcLocation
        type: add_location

The value of the output field is added as an additional entry in the flow log. Thus in the first rule (of type add_service), the field service is added to the flow log, with a textual indication of the service that is being used by the connection referenced by the flow log (based on destination port number). In the second rule (of type add_subnet), the field srcSubnet is added to the flow log.

The add_location type of transform_network maps an IP address into geographic coordinates where the IP is defined, based on information in a location database. The visualization of the add_location step (in transform_network) gives us the map shown at the beginning of our discussion.

We can see the density of source and/or target locations of the network traffic.

Extract Aggregate

Aggregates are used to combine the results of common flow logs based on specified rules and to export the results as metrics. Aggregates are dynamically created based on defined values from fields in the flow logs and on mathematical functions to be performed on these values. The resulting aggregate is added as another field in the flow log with the specified name.

The configuration parameters for extract_aggregate looks like the following.

- name: extract_aggregate
  extract:
    type: aggregates
    aggregates:
    - name: bandwidth_network_service
      groupByKeys:
      - service
      operationType: sum
      operationKey: bytes
    - name: bandwidth_source_subnet
      groupByKeys:
      - srcSubnet
      operationType: sum
      operationKey: bytes
    - name: src_connection_count
      groupByKeys:
      - srcSubnet
      operationType: count

In this example, the service field that was produced by the transform_network stage is used to identify all flow logs that used a particular service (e.g. https) and sums the number of bytes transferred to that service to produce a new metric called bandwidth_network_service. In the second rule the number of bytes of all flow logs with the same srcSubnet are summed to produce the new field bandwidth_source_subnet. In addition to the specified name for the new field, additional fields are also produced providing information about the aggregate (e.g. recent_count, recent_op_value, total_count, total_value, operation_result), which may be used by subsequent stages (e.g. encode_prom). The following operationTypes are defined: sum, min, max, count, avg or raw_values.

Flexibility of Stage parameters

As mentioned earlier, each stage is configurable. Stages are written with sufficient generality to allow a user (e.g. network administrator) to customize the stage parameters to fit the needs of the specific environment that is being monitored. Developers may also write their own stages (e.g. some new kind of analytics) that conform to the stage interface and add them to the collection of available stages. More extensive details on how to configure FLP for different needs are available in the docs section on github.

Currently available stages

As of this writing, the following stages are implemented.

Ingest

Ingest_collector — a standard NetFlow / IPFIX collector
Ingest_kafka — receive data from a specified kafka topic
Ingest_grpc — ingest GRPC from Network Observability eBPF Agent
Ingest_file — receive flow logs from a file (used mainly for testing and debugging)

Transform

Transform_generic — convert field names
Transform_network — supplement flow logs with network derived information
Transform_filter — Remove entries (or fields) that match some condition

Extract

Extract_aggregate — aggregate multiple flow logs into higher level metrics (sum, max, min, avg)
Extract_conntrack — connection tracking — combine flow logs related to the same connection
Extract_timebased — report on topK (or bottomK) metrics over specified time intervals

Encode/Write

Encode_prom — export metrics to prometheus with specified labels
Encode_kafka — send flow logs to specified kafka topic
Encode_s3 — send flow logs to specified bucket in S3 repository
Write_loki — send flow logs with labels to loki repository
Write_ipfix — export ipfix-formatted flow logs to specified host target
Write_stdout — print flow logs on stdout (for testing and debugging)

A pipeline begins with an ingest stage and ends with either an encode or write stage. Other types of stages (transform, extract) are middle stages, and require at least one input and at least one output. Typically, the output of a stage is channeled to the input of one or more other stages. Transform stages and extract stages may appear in a pipeline in any order, and may appear multiple times. For example, a transform_generic1 stage can be used after an ingest to give all the fields some standard names, then perform some extract stage, and then perform a transform_generic2 stage to again change the variable names to feed into an output stage.

New types of (ingest, transform, encode, encode/write) stages can be written by a developer that fit the defined interfaces and can easily be incorporated into some custom pipeline deployment.

Scalability

It is easy to string together FLP instances from multiple sources with the help of Kafka. For example, an FLP instance can process flow logs on an individual (physical or virtual) machine, and feed the results into a configurable Kafka topic. Multiple FLP instances can send results to the same Kafka topic. The results sent to a Kafka topic can then be passed as input to another FLP instance (or other tool), which may then combine the results to obtain global insights.

Summary

In this blog we introduced the Flowlogs Pipeline (FLP) framework to process flow logs and produce insights into what is happening in a network. We went into details of a specific example and showed how to build a pipeline from a library of existing stages. We invite the reader to experiment with the FLP framework and to use it to produce relevant metrics. We welcome feedback on how to improve the framework, additional recommended analytics, and code contributions.

Additional information and details can be found in the github repository: https://github.com/netobserv/flowlogs-pipeline.

You can, of course, deploy the OpenShift Network Observability Operator available on OperatorHub and experience quickly some of the network monitoring capabilities of FLP.

Acknowledgements: Thanks to Julien Pinsonneau, Steven Lee, Joel Takvorian, and Sara Thomas for feedback and input in preparing this blog.