Collecting NetFlow Records with Cloudera DataFlow

Published in

Cloudera

11 min readJul 27, 2023

Background

Cisco created the initial version of NetFlow in 1996 to provide a structured method for exporting, collecting, and analyzing Internet Protocol communication. NetFlow Version 1 supported Internet Protocol Version 4 and defined a binary representation for source and destination network packets, including addresses, ports, and bytes transferred. After several internal iterations, NetFlow Version 5 extended the format to include router and address masking fields, along with additional header information. RFC 3954 codified NetFlow Version 9 as a common standard, introducing extensible record templates to support a wide range of use cases. With the introduction of record templates, NetFlow 9 added support for Internet Protocol Version 6, allowing implementations to define export records according to applicable hardware or software capabilities. RFC 7011 introduced IP Flow Information Export as generalized successor to NetFlow, but NetFlow 5 and NetFlow 9 remain common formats for network communication collection and analysis.

Introduction

Based on Apache NiFi, Cloudera DataFlow provides enterprise support and additional components for integrating with a wide array of systems and services. Cloudera DataFlow now includes a Processor named ListenNetFlow for collecting and parsing NetFlow Versions 9, 5, and 1. The ListenNetFlow Processor provides seamless and scalable collection from multiple sources, producing batches of records using a configurable output Record Writer. With record-oriented batching, ListenNetFlow translates packets into a standard schema that can be exported as Avro, CSV, JSON, Parquet, XML, or any other format available through the configurable Record Writer.

Building on the ListenNetFlow Processor, flow definitions can support security alerting, aggregation, and historical analysis using existing NiFi components. Processors such as QueryRecord support filtering and basic transformation to align with operational requirements. Other Processors such as ForkEnrichment and JoinEnrichment enable merging input records with information from sources such as databases or web services. Still other Processors support sending records to streaming systems such as Apache Kafka, and persisting records in relational databases or object storage services.

With an understanding of protocol characteristics, configuration options, and record structure, flow developers can customize ListenNetFlow to meet the particular needs of many environments.

NetFlow Protocol Characteristics

Exporting systems send NetFlow information over User Datagram Protocol using structured binary packets. Although RFC 3954 does not specify a registered port number for NetFlow, Cisco and other vendors commonly use UDP port 2055. As a connectionless protocol, UDP does not provide delivery guarantees, but the simple approach to transmission supports high volumes with minimal overhead. Each packet includes a header section that begins with the NetFlow version number, enabling NetFlow collectors to support conditional processing for multiple NetFlow versions. The content of NetFlow packets differs based on the version, but all versions use binary field encoding for concise message formatting.

Common Field Structure in NetFlow Version 5

NetFlow packets include a standard header section with a predefined size and set of fields. Following the version number field, the header contains a count of the flow records included, and the uptime of the sending system. Additional fields depend on the specific NetFlow version, with version 5 adding a sequence number that can be used to track the history of a specific exporting system.

NetFlow versions 1 and 5 use a predefined binary structure, with each packet containing the same number of fields and having the same total size. This common structure enables simplified packet construction and parsing, but also limits the ability of NetFlow exporters to include additional details. Although the standard set of fields in NetFlow 5 and earlier enabled stateless parsing of NetFlow packets, the structural limitations could not keep pace with evolving network technologies.

Template Field Structure in NetFlow Version 9

NetFlow version 9 introduced significant changes to enable adoption from a diverse set of vendors, supporting more rapid adaptation to network communication changes. Version 9 retained the common header section, but moved away from a common field structure for flow record reporting.

Instead of a common structure and limited set of fields, NetFlow 9 introduced the concept of templates, allowing exporters to describe the fields contained in subsequent packets. To maintain efficiency, NetFlow exporters do not send templates in every packet. This requires NetFlow collectors to maintain a stateful cache of record templates. The template definition includes type and length information for each field, without which the NetFlow collector cannot parse flow data records. It is the responsibility of the NetFlow exporter to send template information on a periodic basis so that the NetFlow collector can interpret the binary field lengths and values correctly.

NetFlow 9 also introduced the concept of Option Data Records to differentiate exporter metadata from Flow Data Records. As indicated by the name, Option Data Records may or may not be present in NetFlow 9 flows. Option Data Records also use templates to describe field types and values.

ListenNetFlow Configuration

The ListenNetFlow Processor includes several configuration properties with reasonable default values. The initial version of ListenNetFlow supports NetFlow 9, 5, and 1 based on the version field of each NetFlow packet header. It is the responsibility of the device exporting NetFlow packets to set the output version. ListenNetFlow will log a warning when receiving NetFlow packets with a version outside the scope of supported versions.

The default Run Schedule for ListenNetFlow is 25 milliseconds, which aims to provide a minimum amount of time for gathering a batch of NetFlow records. The Run Schedule can be increased to provide more time for the Processor to build up a queue for message output batches. However, the Run Schedule should be low enough to avoid unnecessary memory consumption due to queuing large numbers of records. The best setting depends on the expected data volume for the particular instance of ListenNetFlow.

Network Socket Properties

The Address and Port properties control the socket address on which ListenNetFlow receives NetFlow packets.

The Address property defaults to 0.0.0.0, indicating that ListenNetFlow can receive packets on any available network interface. For NiFi servers with multiple network interfaces, the property can be set to a specific IPv4 or IPv6 address.

The Port property defaults to 2055, which is a conventional standard based on historical configuration of Cisco NetFlow collectors. This value can be set to a maximum port number of 65535, following standard UDP ranges. When configuring multiple instances of the ListenNetFlow Processor on the same NiFi server, each instance of the Processor must use a unique port number to avoid conflicts that result in bind exceptions when attempting to start the Processor.

Packet Handling Properties

The Worker Threads property controls the number of Java threads responsible for decoding NetFlow packets and placing records on an internal queue for subsequent framework processing.

The default value for Worker Threads is 2. This value is independent of the standard Concurrent Tasks setting common to every NiFi Processor. The reason for having separate control over Worker Threads is to optimize both network packet handling and subsequent framework processing. Having a separate pool of Worker Threads allows the ListenNetFlow Processor to receive and decode NetFlow packets while the NiFi framework is running other Processors. This minimizes congestion for incoming NetFlow packets and supports batching output records. The number of Worker Threads should never exceed the number of available CPU cores, and should always be less than half that number. This general rule leaves CPU cycles available for framework processing, while allowing increased packet handling rates in some scenarios.

The Queue Capacity and Batch Size properties should be evaluated together with the number of Worker Threads.

The Queue Capacity defines the number of records that ListenNetFlow can decode and prepare for subsequent batch distribution. Worker Threads receive NetFlow packets and place decoded records on the queue, then NiFi framework threads call ListenNetFlow to distribute record batches through the success relationship. The Queue Capacity should not be too large, as each record on the queue consumes available heap memory. On the other hand, if the Queue Capacity is too small, Worker Threads will block and not be able to handle more incoming packets. The Queue Capacity should be large enough to handle peak throughput, which is a function of the number of sending devices, the amount of network statistics collected, and the relative performance capabilities of the NiFi system. The default Queue Capacity of 10,000 supports a reasonable amount of network activity, but this value should be increased for environments with large numbers of exporting devices.

The Batch Size controls the maximum number of records per FlowFile that ListenNetFlow sends to the success relationship. The default value of 1,000 provides a reasonable minimum, but this value can be increased to support higher volumes. It is important to note that the Batch Size is a maximum number and that FlowFiles will contain a smaller numbers of records for flows with lower data volumes.

Record Writer Configuration

The Record Writer property controls the output format for serialized NetFlow records. The JsonRecordSetWriter is a standard solution for writing arrays of JSON objects. The AvroRecordSetWriter is another standard service that supports the binary Apache Avro standard with associated schema definition. In addition to standard Record Writer services, this property enables custom output formats based on Controller Services that implement the shared RecordWriterFactory interface definition.

ListenNetFlow Record Schema Definition

The ListenNetFlow Processor uses a hybrid of standard and dynamic elements to describe NetFlow records. The schema definition provides the basis for subsequent processing, allowing other Processors to restructure records as needed. The specifics of each field depend on the NetFlow packet version, so flows designed to process multiple versions may need to perform conditional normalization.

Standard Schema Fields

ListenNetFlow produces NetFlow records that contain the following fields regardless of the NetFlow packet version:

exporterAddress
exporterPort
exporterUptime
exported
packetVersion
packetSequenceNumber
packetSourceId
flowSetId
dataRecordType
collected

The following JSON object provides a sample representation of these fields for a NetFlow Version 9 Data Record:

{
  "exporterAddress": "127.0.0.1",
  "exporterPort": 50000,
  "exporterUptime": 86400,
  "exported": "2000-01-01T00:00:00Z",
  "packetVersion": 9,
  "packetSequenceNumber": 32,
  "packetSourceId": 0,
  "flowSetId": 256,
  "dataRecordType": "FLOW",
  "collected": "2000-01-01T00:00:00Z"
}

The exporter fields contain information about the sending NetFlow system, including source address, the uptime from the packet header, and the timestamp when the sending system exported the record.

The packetVersion field contains the version number that can be used for selective filtering or routing. The packetSequenceNumber comes from the NetFlow header in versions 9 and 5, but defaults to 0 for NetFlow version 1. The sequence number can be used for basic tracking and duplicate detection in the context of a specific NetFlow exporting system. The packetSourceId is specific to NetFlow 9 and differentiates the observation domain, allowing a single NetFlow exporter to provide distinct sets of flows. The packetSourceId defaults to 0 for NetFlow 5 and 1.

The flowSetId is specific to NetFlow 9 and indicates the template identifier associated with the record. ListenNetFlow uses the Flow Set Identifier as part of the cache key for tracking template definitions. This field defaults to 0 for earlier NetFlow versions.

The dataRecordType is an enumerated field with a value of either FLOW or OPTIONS. Most NetFlow records have FLOW as the Record Data Type, but this field will be set to OPTIONS for NetFlow 9 packets containing optional metadata records. This field should be used for filtering in environments that are not interested in exporter metadata information.

The collected field contains the timestamp when ListenNetFlow collected the NetFlow record. This field can be used in conjunction with the exported field to evaluate latency between exporting systems and NiFi servers.

Dynamic Schema Fields

ListenNetFlow uses a nested map of names and values to describe informational fields in a NetFlow record. The fields element will contain the same set of names and values for NetFlow version 5 and 1. The content of the fields element varies for NetFlow 9 records, based on the template field definition.

Each version of the NetFlow protocol includes common field name definitions, however, the field names changed from one version to another. RFC 7012 introduced a shared registry of field names and data types for IPFIX, providing a backward compatible naming strategy.

ListenNetFlow uses the IPFIX Information Elements registry for translating numeric element identifiers to standard field names, regardless of the NetFlow packet version. This approach enables subsequent Processors and storage solutions to refer to the same field across different NetFlow versions. Although NetFlow 9 packets may contain different sets of fields, common field names will match names and types from NetFlow 5 and 1.

For example, the IPFIX registry defines Field Type 8 as sourceIPv4Address and Field Type 12 as destinationIPv4Address. ListenNetFlow uses these field names for NetFlow records across packet versions.

NetFlow Version 9 Data Records can contain any set of elements in the fields section, although many vendors use a similar template of core fields. Collecting and evaluating NetFlow 9 packets is an essential part of flow development. Defining expected behavior in the presence or absence of particular fields ensures data consistency for subsequent processing.

The following JSON object provides a sample NetFlow Version 5 Data Record using the common field names:

{
  "exporterAddress": "127.0.0.1",
  "exporterPort": 50000,
  "exporterUptime": 86400,
  "exported": "2000-01-01T00:00:00Z",
  "packetVersion": 5,
  "packetSequenceNumber": 32,
  "packetSourceId": 0,
  "flowSetId": 0,
  "dataRecordType": "FLOW",
  "collected": "2000-01-01T00:00:00Z",
  "fields": {
    "sourceIPv4Address": "127.0.0.1",
    "destinationIPv4Address": "127.0.0.2",
    "ipNextHopIPv4Address": "127.0.0.3",
    "ingressInterface": 1,
    "egressInterface": 2,
    "packetDeltaCount": 1,
    "octetDeltaCount": 64,
    "flowStartSysUpTime": 3600,
    "flowEndSysUpTime": 3600,
    "sourceTransportPort": 50000,
    "destinationTransportPort": 443,
    "tcpControlBits": 16,
    "protocolIdentifier": 6,
    "ipClassOfService": 0,
    "bgpSourceAsNumber": 0,
    "bgpDestinationAsNumber": 0,
    "sourceIPv4PrefixLength": 32,
    "destinationIPv4PrefixLength": 32
  }
}

NetFlow Version 1 Data Records contain a similar set of elements, but do not include the following properties in the fields section:

bgpSourceAsNumber
bgpDestinationAsNumber
sourceIPv4PrefixLength
destinationIPv4PrefixLength

Based on custom template field sets, NetFlow 9 incorporates support for Internet Protocol Version 6, among many other features. ListenNetFlow represents IPv6 addresses as strings using standard hexadecimal notation. Collecting NetFlow 9 packets from devices that support both IPv4 and IPv6 requires evaluating different field names for source and destination addresses.

ListenNetFlow Implementation Details

The Netty framework provides foundational structure for the ListenNetFlow Processor. Netty includes standard bootstrap classes for creating scalable network services, enabling components such as ListenNetFlow to focus on protocol processing. Building on shared event transport components from the Apache NiFi library, ListenNetFlow implements custom Netty handlers that translate binary datagrams into structured NetFlow record objects.

Netty also provides a flexible byte buffer abstraction that supports efficient memory handling. Netty byte buffers simplify NetFlow field decoding with a large number of convenience methods for reading signed and unsigned values. With the NetFlow specification focused on efficient data transmission, numeric fields can have different binary lengths, making it essential to read the correct number of bytes in order to return the correct value for a given field.

ListenNetFlow uses the NiFi framework State Manager abstraction to support caching NetFlow 9 template records. The Processor has a hybrid approach to caching, using memory-based storage for regular processing and also persisting templates in the local State Provider. This strategy minimizes the potential for parsing failures when restarting ListenNetFlow Processor instances, or restarting the entire NiFi system. ListenNetFlow populates the initial memory cache from the local State Manager and then updates both memory and State Manager caches when receiving new templates. The NiFi user interface supports viewing and clearing the State Manager cache, which can be helpful for testing and troubleshooting. The cache key includes the exporter address and template identifier, providing another way of observing sending systems.

Conclusion

NetFlow collection is an important part of any network observation strategy. Whether supporting cybersecurity monitoring or providing operational awareness, NetFlow statistics enable a number of critical use cases. With the introduction of ListenNetFlow, Cloudera DataFlow combines available Apache NiFi components with enterprise-ready NetFlow packet collection.

Building on the Netty framework and providing a standard record schema, ListenNetFlow highlights the flexibility of Apache NiFi. With new extensions supporting additional integration strategies, Cloudera DataFlow illustrates how the Apache NiFi framework can meet the needs of various deployment environments.