6 metrics for HTTP file extraction

Vivek Rajagopal
Jul 27, 2017 · 4 min read

This is a technical note describing the metrics to consider when implementing live high-speed extraction of files from HTTP streams. This post will be of interest to those managing Network Security Monitoring systems like Trisul , Suricata, Bro,etc.

Here is the biggest problem you will face with HTTP File Extraction.

If not managed carefully, the CPU requirements can explode !

On an older 16CPU HP Proliant G6 we found that a ~1.2Gbps stream of HTTP traffic can use 700–800% CPU with the Trisul “Save Binaries App” enabled but only 250–300% CPU without it.

The HTTP Layer does two things while transferring content.

  1. Chunking. Breaks up content into smaller chunks and then transfers each chunk. Indicated through Transfer-Encoding : chunked
  2. Compression: Compresses content using gzip and then sends it across. The receiver typically a Web Browser, gunzips it and then renders the content. Indicated through Content-Encoding: gzip

Handling the chunking requires some memory to hold the pieces and some housekeeping. Decompression is the real CPU killer for NSM tools. If you are tracking a thousand HTTP gzip streams you are essentially doing the decompression job of a thousand browsers.

The key to a practical HTTP File Extraction is to optimize the usage of the decompressor.

Selecting candidates for extraction

A bit of detail about how Trisul selects files for extraction. Most of the other tools also follow a roughly similar strategy. You cannot rely totally on Content-Type to select files. The trick is to use actual content inspection. Trisul uses libmagic to examine the top of each file and then use a Regex on the returned File Magic string to pick files to save to disk. See the source code of the “Save Binaries App” save_exe_streaming.lua for details.

Trisul uses two stage filtering to select the final candidate.

  1. Stage 1: Content-Type filter. As an exception, Trisul skips the text/html + text/css Content-Types. These two content type constitute the overwhelming majority of the files seen. But the bigger reason is these two are almost always compressed and we need to start a decompressor each time. Trisul’s LUA FileExtraction API allows you to filter just by observing the HTTP header and that does not start the decompressor.
  2. Stage 2: libmagic regex When the libmagic string fires the Regex, we start extracting in a streaming manner. This allows Trisul to save files of ANY size without using a proportionate amount of memory.

Here are 6 Metrics that Trisul NSM collects.

1. File Extraction Rate

Number of files extracted per minute and placed in the output directory. Trisul feeds this metric into the streaming analytics pipeline. File Extraction Rate in itself can be a 2nd order analytics time-series metric on which you can set deviation based alerts.

2. Extracted Bandwidth

Disk Throughput of the extracted file contents. This depends on the number of files extracted , the network speed, and the size of the files. In the image shown below we see the metrics reveal when a 760MB CentOS ISO download was being extracted by Trisul.

3. Top file types

Top-K file types as decided by libmagic of the files extracted. This sample is from a Flash rich enterprise. You may wish to tune the Regex if you find yourself being overwhelmed by a particular file type. See “Save Binaries App” Github page

4. File types trend over time

A stacked area time-series chart of the types of files. Builds baseline awareness of whats happening with various types. The chart shown below was taken from a Flash rich enterprise so the others arent very visible.

5. Skipped — Magic vs Content-Type

Shows how much work was saved by skipping whitelisted filetypes.

The metric below shows how many files were filtered by Content-Type and how many by the Magic string. See the explanation of the Stage-1 and Stage-2 filtering above.

6. Decompressor starts and skips

This metric tells you how many times the decompressor

  • was started for examining the initial chunk of the file vs
  • was skipped. The decompressor is skipped for two reasons. 1) the content-encoding isnt “gzip” or 2) the file was whitelisted by content-type.

The chart below shows an operational normal. The Decompressor was run only for about 15–20% of the files.

Conclusion

Live file extraction of files transferred over HTTP is a very important piece of a Network Security Monitoring toolkit. The other pieces of the tooling involve arranging to get this files into Malware analysis platforms like LAIKA, or YARA.

If you want to take Trisul Network Analytics file extraction for a spin on your network.

  1. Install Trisul on any CentOS or Ubuntu system. (No sign ups, its free)
  2. Install the “Save Binaries” Trisul App and the “Save Binaries Metrics Dashboards” App.

Until next time. Happy hunting.

Vivek Rajagopal

Written by

network security,traffic analysis, cycling, beer

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade