Generate and Analyze (encrypted) Web Browsing Traffic like a PRO

Without any hassle in a simple isolated docker container with built-in browsers, VNC, and Wireshark.

Published in

CodeX

9 min readMar 2, 2023

Even though the rise of encrypted protocols (e.g., DNS-over-HTTPS, TLS 1.3 and Encrypted Client Hellos (ECH), QUIC), website fingerprinting became an interesting research topic.
Why? Because we tend to forget that the application we use most often is our web browser. Be it simple web content consumption, reading your favorite news channels, listening to music on Spotify, watching cooking videos on Youtube, or even using Microsoft web applications to write reports for a project in Linux environments, the key indicator for your daily productivity heavily depends on using a browser.
Accordingly, if a third-party (e.g., your ISP, a malicious entity on the same public WiFi network, an authoritarian regime) can eavesdrop on your web browsing traffic and can identify the domains you are visiting, they can profile you. And since the content is changing way too often today, it is no longer vital what you consume at a given time…it is enough to see where that content is coming from.

Today, even if you tunnel everything through the ToR network, or in other words, you use the ToR browser, it is still possible to identify (from the encrypted packet trace any intermediate node might have eavesdropped on its way from your PC to the destination) what website/domain you are trying to visit with (relatively) high accuracy. Here, I don’t want to jump into the machine learning and discuss how it is done; I rather focus on how you can see, capture, and understand your web traffic, which can also be beneficial if you want to kickstart your machine learning+website fingerprinting research career.

All I see is a mess

What can you do if you are unfamiliar with the website fingerprinting concept, how your traffic looks on the wire, and how much data you reveal daily just by accessing the World Wide Web? You might install Wireshark, start it, and select your wireless/wired interface to be intercepted. And you quickly see hundreds of thousands of packets going back and forth in a vey short time. You can feel that if you do not stop packet capturing in a few minutes, Wireshark might eat up all your memory and crash.

So you stop it before it is too late, stare at the screen, and quickly come to the conclusion that you are in a different league. You cannot even distinguish which thousands of packet is generated by your browser, and which ones are from your operating system and other background processes (e.g., OS update, applications, preceding DNS packets, WiFi control packets, Address Resolution Protocol packets, mDNS packets in your local network), not to mention that most packets will be encrypted anyway.

Why not Docker?

This is why lightweight isolation techniques, such as Docker containers, can come in handy today. Using containers isolates the application space, (its) storage, and the network. Therefore, anything you run in a container can be kept in control. The traffic leaving and entering the container can easily be eavesdropped on as a virtual Ethernet interface is created for the same purpose. This also means, you won’t see any other background traffic/noise of your host system, since by selecting the virtual interface, the filtering is already done.

From now on, I assume you know what I am talking about when I mention Docker. Accordingly, I will not write another appraisal post about containerization and its benefits; you can find a lot online. I further assume you installed the required docker packages on your system, and you also make it ready to use docker-compose.

Instead, I would provide you with an itinerary of quickly generating, capturing, and analyzing web traffic. I developed a container for this purpose. This container has a graphical user interface. Instead of running it with pltaform-dependent tricky docker arguments to get the GUI visible on your screen (e.g., DISPLAY environment variables, --network=host), I found it easier to add VNC support for it.

However, my container is still lightweight: it does not have a full-fledged desktop environment.

The source of the container, in case you want to modify it or rebuild yourself with the latest browsers, is available on Github. It is also available on Docker hub, so you can easily install it.

Let’s see how it works

You can either clone the Github repository to get the docker-compose.yml file or copy-paste it from here.

version: '3.6'

services:
  firefox:
    image: cslev/webbrowser_docker:latest
    container_name: webbrowser
    ports:
      - target: 5900
        published: 5555
        protocol: 'tcp'
    volumes:
      - '/etc/localtime:/etc/localtime:ro'
      - './container_data/cache/:/root/.cache:rw'
      - './container_data/mozilla:/root/.mozilla:rw'
      - './container_data/config:/root/.config:rw'
      - './container_data/SSL:/root/SSL:rw'
    dns:
      - 9.9.9.9
      - 1.1.1.1
    shm_size: '2gb'

Save this file as docker-compose.yml, then run the container by issuing the following command in the same directory where the docker-compose.yml is saved.

# docker-compose up -d
Creating network "webbrowser_docker_default" with the default driver
Creating webbrowser ... done

Let’s briefly explain what this set of instructions does. It launches the container with the name webbrowser . The VNC port (5900) in the containerwill be exposed as TCP port 5555 on your localhost (modify it if you have already a service on your host listening on port 5555). This will make it possible that if you initiate a VNC session to your localhost:5555, then it will be forwarded to the container’s VNC server.

Then, the next instruction (/etc/localtime) will set the container’s timezone to your current timezone. The rest creates sub-directories on your host system and binds them into the container’s directories where the browsers will store their data. The purpose of this is to make easily accessible persistent storage (i.e., the created sub-directories in the exact location where your docker-compose.yml is) for any changes you make to the browsers. For instance, you set Firefox to use a DNS-over-HTTPS resolver. Then, if you stop and remove the container later, all your modifications will still be stored on your host machine. Hence, when you recreate/restart the container from scratch, your previous changes will still be in effect.

Last but not least, the dns settings configure the DNS resolvers to use within the container. You can freely modify it according to your needs. And shm_size increases the shared memory (default 128MB) to 2GB to make the browsers run smoothly.

Once the container is up and running, grab your favorite VNC client and connect to locahost:5555. I will be using Remmina as it is a full-fledged remote desktop client I anyway installed on my system for other purposes.

Add a new connection to Remmina and set the VNC server to be 127.0.0.1:5555.

After hitting Save and Connect, you should see the following desktop tunneled over VNC.

No desktop environment is installed, only an xterm is show in a graphical user interface.

As you can see, no desktop environment is installed. You can also observe the absence of a window manager. Accordingly, please do not quit from the xterm (by accidentally pressing Ctrl+D, or typing quit/logout/exit), otherwise you stuck into the Matrix…and have to restart the container.

Just type firefox into the xterm window to run Firefox browser. For chrome or brave, use the followings:

# google-chrome-stable --no-sandbox
# brave-browser-stable --no-sandbox

Since the container has no non-root users, we run them as root. Since the purpose is about web traffic generation, I did not put any effort into creating non-root users, etc. Just use the no-sandbox option, and the browsers still come up and are ready to use.

In any case, if you want to quit a browser, just close all its tabs. Recall that due to the lack of a desktop environment and window manager, there is no X you could click on to close the app in the usual way.

Capture web traffic

Now, on your host, run Wireshark. Select the interface of the container. This step, however, can be tricky sometimes. Especially, if you anyway have more containers running in the background. In the past, I have created a simple to use shell script for this very issue. Please download the script via:

$ git clone https://github.com/cslev/find_veth_docker
$ cd find_veth_docker

Then, you can launch it and it will list the containers you run and their corresponding interfaces. The script, however, relies on some necessary tools installed in the containers too. Hence, it is possible that for some of your containers, you won’t see what you are looking for. But it is not an issue for our case now, as the webbrowser_docker image has all such tools installed. After running the script, we can see what is the name of the virtual Ethernet device the container is using.

$ ./find_veth_docker.sh 
VETH@HOST CONTAINER
vethb09bc7f webbrowser

Select this interface in Wireshark.

Now, if you open and use any of the browsers in the container to visit a domain, you will be able to capture its corresponding traffic.

First of all, you will instantly see a lot of traffic regarding your VNC connection. Hence, the very first thing we have to do before visiting a domain is to apply the following simple filter in Wireshark to filter those unwanted VNC packets out.

!(tcp.port == 5900)

Now, run Firefox, for instance, and visit google.com. The corresponding trace will look similar to as follows.

The captured traffic trace when visiting google.com using Firefox within the container

What about the encrypted packets?

If you have reached this point, you are indeed interested in understanding how your browser communicates and what messages it sends to where. One crucial line in the docker-compose.yml above was not discussed in detail.
That line is the following.

- './container_data/SSL:/root/SSL:rw'

In a nutshell, there is an environment variable in Linux systems called $SSLKEYLOGFILE. This file, if defined in the user’s .bashrc file, profile (which is in our container) or temporarily in the terminal before starting a browser, can be picked up by browsers like Firefox. Then, Firefox uses this file to store all SSL-related information (e.g., symmetric keys used for encrypting the communication between the client and the visited servers). So the above line in the docker-compose.yml is responsible for accessing the container’s /root/SSL directory on the host under the ./container_data/SSL directory. The container, or more precisely, the .bashrc file of the container’s root user is already configured in a way that the $SSLKEYLOGFILE environment variable points to an ssl-key.log file in /root/SSL/ directory. So, you are done here.

Nevertheless, this file can be fed to Wireshark, and it will decrypt all packets for which the keys are available in the file. To do this, go to Edit -> Preferences and select Protocols on the left-hand side.

Edit -> Preferences: Select Protocol on the left-hand side.

Scroll down until you find TLS, and there browse the ssl-key.log file for the (Pre)-Master-Secret log filename then hit Ok.

Browse the corresponding ssl-key.log file

Afterward, many lines in the Wireshark output will turn to GREEN. All those lines are the decrypted packets using the keys stored in the ssl-key.log file.

Decrypted HTTP messages after connecting to cloudflare.com.

In the next blog post, I will introduce you to a new set of tools that can be used to automate the whole process you just learned from this post.
In particular, I will show you how to run a container to visit thousands of domains sequentially, using the encryption protocols (e.g., DNS-over-HTTPS, QUIC) you prefer or even disallowing them, capturing the corresponding packet trace, and also process them straightaway to have lightweight .csv files instead of heavyweight PCAP files.
Later on, those .csv files can be fed into a machine learning application to train a model and do many interesting things afterward.