Refining Network Message Segmentation with Principal Component Analysis

Stephan Kleber
Mercedes-Benz Tech Innovation
8 min readJan 13, 2023

Reverse engineering of undocumented protocols is a common task in security analyses of networked services. The communication itself, captured in traffic traces, contains much of the necessary information to perform such a protocol reverse engineering. One major challenge is to discover probable fields in a message as the basis for further analyses. Given a set of messages, split into segments of bytes by an existing segmenter, we propose a method to refine the approximation of the field inference. We use principle component analysis to discover linearly correlated variance between sets of message segments.

We presented our research paper at the 2022 IEEE Conference on Communications and Network Security (CNS) and it is available as preprint at arXiv.

Protocol Reverse Engineering

Analyzing the threat posed by botnet and malware communication, validating the correct and secure design and implementation of network services, and efficiently configuring smart fuzzers requires the understanding of the involved network protocols. In case of malware and proprietary systems, the protocols are often undocumented and first require protocol reverse engineering (PRE) to uncover data exfiltration or vulnerabilities in the network services. The analysis of binary protocols that are not human-readable and do not exhibit structural features like keywords and separators that can be seen in Figure 1 is considerably harder than of textual protocols.

Figure 1: Segments of textual and binary protocols.

We propose a segmentation refinement for the binary-protocol segmenter NemeSYS and also derive a new segmenter that works independent of NemeSYS. While previous work typically requires to select protocol-dependent analysis parameters, which is difficult for an unknown protocol, our refinement and segmenter are independent of any parameters. This work is embedded into the larger context of our framework for NEtwork MEssage Reverse Engineering, NemeRE, outlined in Figure 2.

Figure 2: Process of NemeRE: NEtwork MEssage Reverse Engineering.

Clustering of Segments

Illustrated in Figure 1, a “segment” is a subsequence of bytes of a message that approximates a field’s boundaries from the protocol definition in terms of byte positions. In an optimal inference, the segments match the true field boundaries from the unknown syntax. Due to the lack of the true protocol specification with the message format, for PRE we require heuristic segmenters to approximate fields in unknown binary messages.

Figure 3: Collecting similar segments from one trace of multiple messages M.

To obtain sets of comparable, related segments, which we subsequently intend to analyze together, we cluster segments based on their dissimilarity as illustrated in Figure 3. We base our clustering approach on our previous proposal to use Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The used measure is the Canberra dissimilarity which extends the better-known Canberra distance to vectors of differing dimensions. We use the Canberra dissimilarity between segments as affinity measure to guide the clustering. The clustering results in concise sets of segments that overlay best, matching each others’ values measured by their smallest dissimilarity.

Byte-wise Segment Variance Analysis

Inter- and intra-message variance reveals details about the message structure that is not visible otherwise. We determine which bytes vary together in sets of segments that we interpret as data vectors to identify probable field boundaries. We separate vector components, variance-locked to each other, from linearly independent vector components by principal component analysis (PCA). PCA shows which bytes from each set of segments are associated with each other and, thus, comprise one field.

Figure 4: PCA process overview.

As illustrated in Figure 4, before we can start the PCA, we need to prepare groups of segments by a recursive clustering step and after the PCA, we interpret its results to adjust the boundaries of the raw input segments.

The PCA calculates the multivariate main direction of variance and its scale. The multivariate variance identifies the components that vary together, i. e., linearly dependent. The data vectors are collected into a matrix X with each vector as row, in our case corresponding to the bytes of one segment. The variance of X is described by the covariance matrix C. The eigenvalues and eigenvectors of the covariance matrix C are the foundation of the PCA. The first principal component (PC) is the highest eigenvalue λ and its associated eigenvector w_i and it intuitively states the direction of the prevalent variance in the data. We gain the strength of the variance lock, respectively the linear dependence of the relative byte offsets in one set of segments from these eigenvectors. The variance of offsets is visualized in the plots in Figure 6.

Interpretation of the PCA

The covariance shows transitions between related message parts in the byte sequences of a set of segments provided by a cluster. To get an impression about how the covariance matrix C represents such related parts, regard Figure 5. In this heatmap of covariance values, each row and each column correspond to the relative offsets in the scope of the overlayed set of segments of one cluster. White lines are the relative positions of true boundaries in this set. We interpret the PCA by two rules to refine field boundaries.

Figure 5: Covariance matrix as heat map.

Rule A: The first rule is governed by the observation that the data of a field with a common numerical data type exhibits a rise in variance towards the field’s end. Using the same cluster as in Figure 5 as basis, Figure 6, left, illustrates this by the eigenvectors of the significant PCs for segments of 25 bytes length. We now search for a considerable relative drop in the absolute covariance after the peak at the end of a suspected field.

Rule B: The second rule detects the start of a new field in situations where a prolonged sequence of byte positions do not significantly vary and the next byte suddenly peaks in its variance, illustrated in Figure 6, right. This typically happens at the transition between constant fields or such with few possible values, like enumerations, commands, or flags.

Figure 6: Example for rule matches of rule A (left) and B (right): significant eigenvectors of one cluster. The dot denotes the match, the vertical lines the true boundaries.

Evaluation

Using our proof-of-concept implementation in Python 3, we evaluate our approach. Additionally to the protocols DHCP, DNS, NBNS, NTP, and SMB that we used to select the parameters for our algorithm, we also use our own traces of the two proprietary protocols Apple Wireless Direct Link (AWDL) and Auto Unlock (AU). AWDL is a Wi-Fi-based link-layer protocol for peer-to-peer communication. AU implements a proprietary distance bounding protocol. Both protocols were undocumented until they have manually been reverse engineered recently. The reverse-engineered specification of AWDL, including a dissector, is publicly available, and we had access to a private Wireshark dissector of the AU protocol. Thus, both protocols are typical PRE use cases while we have ground truth available. As the source of the ground truth, we parse the Wireshark dissectors’ output for each message.

The immediate effect we expect of our refinement is that the segment boundaries will more accurately match the field boundaries of the protocol specification. To measure this, we use the Format Match Score (FMS) that we proposed together with NemeSYS. The results, illustrated in Figure 7, show that traces of 1 000 and 100 messages are almost identical, confirming that decent results can be produced with even small traces.

Figure 7: Results of the refinement of message segmentation: Median improvement: 85 % (100 messages) and 74 % (1 000 messages)

In addition, we used our novel Null-Bytes Segmenter that works without NemeSYS and applied the PCA refinement to it. The resulting FMS values on average are considerably lower than those of NemePCA, but similar to the baseline. This shows the advantage of using NemeSYS as a heuristic method and that the complex effort of refining its segmentation is worthwhile.

A second analysis based on the refined segments that we use as evaluation aspect is the clustering of field data types from NemeSYS’ message segments. Our refined segments result in an increased recall of the clustered field types, as is plotted in Figure 8, while we reduce the number of unintelligible segments.

We discuss further evaluation aspects in the paper.

Figure 8: Clustering quality results in terms of precision (P) and recall (R) of field type clustering from PCA-refined segments and NemeSYS for comparison.

Conclusion

Our novel, dynamic PCA method measures the byte-wise variance of segment contents to increase the accuracy of the field boundary detection in unknown binary protocols over previous work. By combining PCA with the simple Null-Bytes Segmenter, we show that our refinement is not limited to be effective in conjunction with NemeSYS but is applicable to arbitrary segmentations.

The inference of the message format is a crucial task in protocol reverse engineering. Our method provides knowledge about the placement of field boundaries in protocol messages, which is necessary to correlate information to field values. Among other use cases, it helps to determine portions of the message to inspect, e. g., by Fuzz testing, to identify features for the fingerprinting of protocols and to train intrusion detection systems.

Acknowledgment

We would like to thank Milan Stute for his support and the AWDL traces, as well as Steffen Klee for providing us with a Wireshark dissector and traces for Apple’s Auto Unlock (AU) protocol.

References

  1. Stephan Kleber and Frank Kargl, “Refining Network Message Segmentation with Principal Component Analysis”. In: CNS. 2022.
  2. Stephan Kleber, Rens Wouter van der Heijden, and Frank Kargl. “Message Type Identification of Binary Network Protocols using Continuous Segment Similarity”. In: INFOCOM. 2020.
  3. Stephan Kleber, Henning Kopp, and Frank Kargl. “NEMESYS: Network Message Syntax Reverse Engineering by Analysis of the Intrinsic Structure of Individual Messages”. In: WOOT. 2018.
  4. Stephan Kleber, Lisa Maile, and Frank Kargl. “Survey of Protocol Reverse Engineering Algorithms: Decomposition of Tools for Static Traffic Analysis”. In: IEEE Communications Surveys and Tutorials 21.1 (Feb. 2019). Firstquarter.
  5. Stephan Kleber, Milan Stute, Matthias Hollick, and Frank Kargl. “Network Message Field Type Classification and Recognition for Unknown Binary Protocols”. In: DCDS. 2022.
  6. Milan Stute, David Kreitschmann, and Matthias Hollick. “One Billion Apples’ Secret Sauce: Recipe for the Apple Wireless Direct Link Ad hoc Protocol”. In: MobiCom ’18. 2018.

--

--