Extracting the payload from Tshark directly file using Python- Part II

Published in

Vera Worri

11 min readFeb 9, 2017

I am continuing to work on building a UI and making the program run continuously for the program I created here. I have also realized that some of the legitimate headers where being dropped. Now, I want to make the sniff continuous.

Working with Pycharm on a Mac, with Python 2.7.12 from Anaconda.

The program is meant to run for a long time. The program will sit, sniff, and save the packets that have a specified content-type (forms is default).

The gitHub repo can be found here.

The basic functionality of the program right now is to:

sniff traffic using tshark and save it to a pcap =>(Eavesdrop.py)
then read the pcap and break out the http payload=> (Eavesdrop.py)
break only the headers from the http_payloads=> (getHTTPHeaders.py)
turn the headers into dictionaries for easy handling=>(getHTTPHeaders.py)
get the ones with a “Content-Type” key(getHTTPHeaders.py)=>
match the “Content-Type” key with the wanted content type (text, image, form, etc)=>(getHTTPHeaders.py)
break out the payload associated with the matched header =>(getHTTPHeaders.py)
save the payload into a file=> (Eavesdrop.py)

Eavesdrop.py

from scapy.all import *
from getHTTPHeaders import HTTPHeaders, extractText
import os
import time


# os.system("tshark  -T fields -e _ws.col.Info -e http -e frame.time -e  "
# "data.data -w Eavesdrop_Data.pcap -c 1000")
x = "tshark  -T fields -e _ws.col.Info -e http -e frame.time -e data.data -w E.pcap -c 1"
y = 'Eavesdrop_Data.pcap'
T= "form"


def eavesdrop(x,y,T):
    #subprocess.Popen(x, shell=True)
    subprocess.call(x, shell=True)
    while not os.path.exists(y):
        time.sleep(1)
    if os.path.isfile(y):
        data = y
        a = rdpcap(data)
        sessions = a.sessions()
        text_file = open("Output.txt", "w")
        for session in sessions:
            http_payload = ""
            for packet in sessions[session]:
                try:
                    if packet[TCP].dport == 80 or packet[TCP].sport == 80:
                        http_payload += str(packet[TCP].payload)
                except:
                    pass

            headers = HTTPHeaders(http_payload)

            if headers is None:
                continue
            text = extractText(headers,http_payload,T)

            if text is not None:
                try:

                    text_file.write("Payload::  " + '\n' + text + '\n')

                except:
                    text_file.write("Something went wrong + \n")
        text_file.close()
eavesdrop(x,y,T)

getHTTPHeaders.py

import re
import zlib
from mimetools import Message
def HTTPHeaders(http_payload):
    try:
        # isolate headers
        headers_raw = http_payload[:http_payload.index("\r\n\r\n") + 2]
     
        regex = ur"(?:[\r\n]{0,1})(\w+\-\w+|\w+)(?:\ *:\ *)([^\r\n]*)(?:[\r\n]{0,1})"
        headers = dict(re.findall(regex, headers_raw))
        print headers
        return headers
    except:
        return None
    if 'Content-Type' not in headers:
        return None
    return headers

def extractText(headers, http_payload, type):
        text = None
        try:
            if type in headers['Content-Type']:
                text = http_payload[http_payload.index("\r\n\r\n")+3:]
                try:
                    if "Accept-Encoding" in headers.keys():
                        if headers['Accept-Encoding'] == "gzip":
                            text = zlib.decompress(text,  16+zlib.MAX_WBITS)
                    elif headers['Content-Encoding'] == "deflate":
                        text = zlib.decompress(text)
                except: pass
        except:
            return None
        return text

UserInt.py (the UI that the program is wrapped in)


import npyscreen
from Eavesdrop import eavesdrop
i=0
v = ""


captureDurationTypeDict =[' -c ',
                           ' -a files: ' ,
                           ' -a duration: ',
                          " -a filesize: "]
captureFieldsDict = {
0: ' -e _ws.col.Info ',
    1:' -e http ' ,
    2: ' -e frame.number ',
    3:" -e ip.addr ",

}

outputType = {
0: 'text',
1:'forms' ,
2: 'image',
3:" audio",

}

class EavesdropForm(npyscreen.Form):
    def create(self):
        self.outputType = self.add(npyscreen.TitleSelectOne, max_height=6, name='Wanted Output',
                                     values=['Text','Forms', 'Images', 'Audio'], scroll_exit=True)
        self.captureDurationType  = self.add(npyscreen.TitleSelectOne,max_height=6, name='Capture Type', values=['Capture By Packet Count',"Capture By File Size" ,'Capture By Time Limit', "Capture By Number of Files"],scroll_exit=True)
        self.duration = self.add(npyscreen.TitleText, name="Duration Value: ")
        self.fileName = self.add(npyscreen.TitleFilename, name="Filename:" )
        self.capFields = self.add(npyscreen.TitleMultiSelect, max_height=6, name='Tshark Fields', values=['Info Column', 'HTTP', 'frame Number', 'IP Address'], scroll_exit=True)
        self.capProm= self.add(npyscreen.TitleSelectOne, max_height=6, name='!WARNIN! Promiscous Mode',values=['Promiscuous Mode'], scroll_exit=True)
        self.stats = self.add(npyscreen.TitleMultiSelect, max_height=6, name='Capture Statistics',
                                     values=['Conversations', 'http', 'DNS', 'Endpoints','Follow TCP/UDP'], scroll_exit=True)
def convertToComand(type,dur,fields,filename):
    v="tshark -T fields "
    for items in fields:
         v= v + (captureFieldsDict[items])
    v = v + captureDurationTypeDict[type[0]] + dur + " -w " + filename + ".pcap"
    return v


def myFunction(*args):
    F = EavesdropForm(name = "Eavesdrop")
    F.edit()
    t= F.captureDurationType.value
    wantedOutputIndex = F.outputType.value
    wantedOutput = outputType[wantedOutputIndex[0]]
    val= F.duration.value
    fields = F.capFields.value
    name = F.fileName.value
    d = name + ".pcap"
    stats =F.stats.value
    command = convertToComand(t,val,fields,name)
    output = eavesdrop(command,d, wantedOutput)
    str(output)
    return output

if __name__ == '__main__':
    print (npyscreen.wrapper_basic(myFunction))

Now, it will be a bit simpler in some respects. Because I just need the raw payloads and I do not need to decode or parse them in anyway, I can forget about getHTTPHeaders.py and the last part of eavesdrop.py (for now). I can also remove some menu options from userInt.py. Now it looks like this:

UserInt.py

import npyscreen
from Eavesdrop import eavesdrop
i=0
v = ""


captureDurationTypeDict =[' -c ',
                           ' -a files: ' ,
                           ' -a duration: ',
                          " -a filesize: ",
                          ""]
captureFieldsDict = {
0: ' -e _ws.col.Info ',
    1:' -e http ' ,
    2: ' -e frame.number ',
    3:" -e ip.addr ",
    4: ''

}

outputType = {
0: 'text',
1:'forms' ,
2: 'image',
3:" audio",
4: ''
}


class EavesdropForm(npyscreen.Form):
    def create(self):
        self.captureDurationType  = self.add(npyscreen.TitleSelectOne,max_height=6, name='Capture Type', values=['Capture By Packet Count',"Capture By File Size" ,'Capture By Time Limit', "Capture By Number of Files", 'none'],scroll_exit=True)
        self.duration = self.add(npyscreen.TitleText, name="Duration Value: ")
        self.fileName = self.add(npyscreen.TitleFilename, name="Filename:" )
        self.capFields = self.add(npyscreen.TitleMultiSelect, max_height=6, name='Tshark Fields', values=['Info Column', 'HTTP', 'frame Number', 'IP Address', 'none'], scroll_exit=True)
        self.capProm = self.add(npyscreen.TitleSelectOne, max_height=6, name='!WARNING! Promiscous Mode',values=['Promiscuous Mode', 'none'], scroll_exit=True)






def convertToComand(type,dur,fields,filename,prom):
    v="tshark"
    if prom == [0]:
        v += ' -p'
    if fields != []:
        v += " -T fields "
        for items in fields:
            v += " " + (captureFieldsDict[items])
    if type != []:
        v = v + captureDurationTypeDict[type[0]] + dur
    if filename !=' ':
        v + " -w " + filename + ".pcap"


    return v


def myFunction(*args):
    F = EavesdropForm(name = "Eavesdrop")
    F.edit()
    t= F.captureDurationType.value
    promisc = F.capProm.value
    wantedOutputIndex = F.outputType.value
    val= F.duration.value
    fields = F.capFields.value
    name = F.fileName.value
    d = name
    command = convertToComand(t,val,fields,name, promisc)
    output = eavesdrop(contSniffer,d)
    return

if __name__ == '__main__':
    npyscreen.wrapper_basic(myFunction)

Eavesdrop.py;

from scapy.all import *
from getHTTPHeaders import HTTPHeaders, extractText
import os
import time



def eavesdrop(x,y):
    p = subprocess.call(x, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    #out = p.stdout.read()
    data = y
    a = rdpcap(data)
    sessions = a.sessions()
    for session in sessions:
        http_payload = ""
        for packet in sessions[session]:
            try:
                http_payload += str(packet[TCP].payload)
            except:
                pass

    text_file = open("Output.txt", "w")
    text_file.write("Payload:: " % http_payload)
    text_file.close()

As you can see, in eavesdrop.py, I just write the http_payload directly into the file. I also do not bother about the port of the TCP payload.

Now, I have to figure out how to get the packets into python as tshark sniffs them and avoid having to save them into a pcap file.

First, I used the -l flag from the tshark documentation to allow piping from the tshark end. Then I figure I can use the subprocess stdout.

I want to start modulating my code a bit to so I create an eavesdrop class and start making some instances. This will make it easier when I start building on my app (the ui)

class Eavesdrop():
    #print device list
    def __init__(self):
        p = subprocess.Popen("tshark -D", stdout=subprocess.PIPE, shell=True)
        self.devList = p.communicate()[0]

    # use to print tshark versions and options then exits
    def info(self):
        p = subprocess.Popen("tshark -h", stdout=subprocess.PIPE, shell=True)
        result = p.communicate()
        self.devList = result[0]
    def contSniff(self):
        p = subprocess.Popen("tshark -V  -p -l  -S '::::END OF PACKET::::::' ", stdout=subprocess.PIPE,
                             stderr=subprocess.PIPE, shell=True)
        for line in iter(p.stdout.readline, '\r\n\r\n'):
    data = ""
    if ('::::END OF PACKET::::::' not in line):
        data += line
    print data

I changed the command I used for tshark to describe the end of the packets as well as well as allow for piping the output to python.

The output for the contSniff() instance looks like the above picture.

Lovely! This output is the same as if you opened up all of the sections in a wireshark sniff… and I mean all of them.

If you use tshark or the like, you will notice that (if you are only capturing a certain quantity) when the capture is finished, it will say this # captured and this # dropped. If I understand correctly, sometimes the packet come in so quickly, they cannot be captured in their entirety.

Because I am reading directly from the stdout, I get EVERYTHING. Luckily, these packets are labeled ‘Malformed”. I can add a metric to keep track of them so we can see if there is a problem at anytime.

I have been thinking about how to get the packets in a form that would allow for exploration and syphoning. I am having trouble with scope and also parsing the text into something like a dictionary. Ideally, it would be like in Wireshark where all the information is nested and you can explore depending on the headers.

If I just do something like:

if 'form-data' in data:
    print data

This code does pull out the forms that come in and out but I want something a bit more precise but flexible.

While researching the best way to get this done, I revamp the UI (still using npyscreen). I delete all of the other options and modify the Eavesdrop class. Now, I can call Eacvesdrop().getList to get a list of all devices available to tshark.

The command line tshark -D returns a string form of a numbered list:

en0 (Wi-Fi) 2. bridge0 (Thunderbolt Bridge) 3. p2p0 4. awdl0 5. utun0 6. en1 (Thunderbolt 1) 7. en2 (Thunderbolt 2) 8. lo0 (Loopback) 9. gif0 10. stf0 11. cisco (Cisco remote capture) 12. randpkt (Random packet generator) 13. ssh (SSH remote capture)

If you just split the string, you would end up with a funky list and because I am not sure what kind of devices the program may return, I use regex (everyones favorite) to split the string according to the number and period.

re.compile("[0-9][.]").split(devlist)

I feed the result into an instance of npyscreen.TitleSelectOne in the value parameter and voilà!

Eavesdrop.py:

import re
import subprocess


class Eavesdrop():
    #print device list
    def __init__(self):
        pass

    def getList(self):
        p = subprocess.Popen("tshark -D", stdout=subprocess.PIPE, shell=True)
        devlist = p.communicate()[0]
        self.devList = re.compile("[0-9][.]").split(devlist)
        self.devList.pop(0)
        return self.devList

    # use to print tshark versions and options then exits
    def info(self):
        p = subprocess.Popen("tshark -h", stdout=subprocess.PIPE, shell=True)
        result = p.communicate()
        self.devList = result[0]
    def contSniff(self):
        p = subprocess.Popen("tshark -V  -p -l  -S '::::END OF PACKET::::::' ", stdout=subprocess.PIPE,
                             stderr=subprocess.PIPE, shell=True)
        for line in iter(p.stdout.readline, '\r\n\r\n'):
            data = ""
            if ('::::END OF PACKET::::::' not in line):
                data += line
            print data

UserInt.py

class EavesdropForm(npyscreen.Form):
    def create(self):
        self.captureDevice = self.add(npyscreen.TitleSelectOne,max_height=20, name='Capture Device', values= Eavesdrop().getList(),scroll_exit=True)


def myFunction(*args):
    F = EavesdropForm(name = "Eavesdrop")
    F.edit()
    dev = F.captureDevice.value
    return dev

if __name__ == '__main__':
    print npyscreen.wrapper_basic(myFunction)

I get this in the terminal:

I make the max_height parameter larger to accommodate the length of the list. I may take it down later but the scroll will always be set to True. When I enter an option and press ok, I get the index of the option that corresponds to the original list.

Now, back to the bigger problem. The packet output is a “tabbed” tree. This means that the nesting I implement will be based on how many “tabs” precede the line. I say “tabbed” because they are not really tabs, they are just four spaces.

I do not see any way to do this without regex .*throws tantrum*.

First, I try to circumvent this unfortunate inevitability with json.dumps(). And…

The output doesn’t look right. When I check for type, I get back <type ‘str’> . Shucks!

So I went in with a hammer to iterate through the returned string looking for how many spaces the information has been indented.

First, I made variables to house the key, first value, and second value. Then I put in if statements to take care of the conditions.

I. No tabs at all

II.Only one tab, (4 spaces)

II. two tabs, (8 spaces)

The function looks like this:

def parsePacket(pkt):
    patt = re.compile("[^\n]+")
    x = patt.findall(pkt)
    key = ""
    val1 = ""
    val2 = ""
    que = OrderedDict()
    for l in x:
        if "    " not in l and "        " not in l:
            key = l
            que[key] = ''
        if "    " in l and "        " not in l:
            val1 += l
            que[key] = val1
        if "    " not in l and "        " in l:
            val2 += l
            que[key][val1] = val2
    return que

When I defined que as a regular dict, the ordering was not uniform so I had to use an ordered dict.

If anyone anyone know of a more elegant solution. Give me a shout!

The loop, after running for a few minutes maxes out at 0.00813007354736 in nano seconds (the program will run continuously).

Now, I want to give some love to my UI. I use npyscreen.NPSAppManaged because I want to add different screens and more functionality. Then I move the myFunction block into on_ok() which is a built in function?/ instance? that allows me to dictate what happens when the user presses the ok button. I also create and array of possible content types for the user to select but, I may just make it a text input parameter to give the user more control over the syphoning of relevant packets.

UserInt.py

import npyscreen
from Eavesdrop import Eavesdrop

contentType = ['multipart/form-data','text','video','audio', 'image']


class EavesdropApp(npyscreen.NPSAppManaged):
    def onStart(self):
        self.addForm("MAIN", EavesdropForm, name="Sniffing Parameters")


class EavesdropForm(npyscreen.ActionFormV2):
    def afterEditing(self):
        self.parentApp.setNextForm(None)
    def create(self):
        self.captureDevice = self.add(npyscreen.TitleSelectOne,max_height=20, name='Capture Device', values= Eavesdrop().getList(),scroll_exit=True)
        self.Contenttype = self.add(npyscreen.TitleSelectOne, max_height=20, name='Content Type',
                                      values=contentType, scroll_exit=True)


    def on_ok(self):
        #npyscreen.notify_confirm("Starting Sniff:     ")
        F = EavesdropForm(name="Eavesdrop")
        F.edit()
        dev = F.captureDevice.value
        type = F.Contenttype.value
        Sniffoutput = Eavesdrop().contSniff()
        return Sniffoutput





if __name__ == '__main__':
    EavesdropApp().run()

Eavesdrop.py:

import re
import subprocess
import json
from collections import OrderedDict
class Eavesdrop():
    # print device list
    def __init__(self):
        pass    def getList(self):
        p = subprocess.Popen("tshark -D", stdout=subprocess.PIPE, shell=True)
        devlist = p.communicate()[0]
        self.devList = re.compile("[0-9][.]").split(devlist)
        self.devList.pop(0)
        return self.devList    # use to print tshark versions and options then exits
    def info(self):
        p = subprocess.Popen("tshark -h", stdout=subprocess.PIPE, shell=True)
        result = p.communicate()
        self.devList = result[0]    # Parse the packets into ordered dict
    def parsePacket(self, pkt):
        patt = re.compile("[^\n]+")
        x = patt.findall(pkt)
        key = ""
        val1 = ""
        val2 = ""
        que = OrderedDict()
        for l in x:
            if "    " not in l and "        " not in l:
                key = l
                que[key] = ''
            if "    " in l and "        " not in l:
                val1 += l
                que[key] = val1
            if "    " not in l and "        " in l:
                val2 += l
                que[key][val1] = val2
        return que    def contSniff(self):
        count = 0
        data = ""
        p = subprocess.Popen("tshark -V  -l -p  -S '::::END OF PACKET::::::' ", stdout=subprocess.PIPE,
                             stderr=subprocess.PIPE, shell=True)
        for line in iter(p.stdout.readline, '\n\r\n'):
            if ('::::END OF PACKET::::::' not in line):
                data += line
            else:
                packet = data
                data = ""
                print(self.parsePacket(packet))
            if "malformed" in data:
                count += 1

I am no where near done and I will keep writing about this process. At this point, I have two functional problems and a long list of expansions I would like to make.

Functional problem:

-[ ] The TUI will not run the sniff if I press the ok button but it will run if I press the cancel button and there *may* be indents greater than 8 spaces to worry about

Expansions in the works:

-[ ] Better search for the packets
-[ ] choose what type of payload he wants to save.
-[ ] the user can go in and see how many of each kind of requests and responses the device has been receiving.
-[ ] choose where the payload is being stored through the TUI
-[ ] a kill the switch for the sniff
-[ ] count/ see a graph of malformed packets to see if Hal is misbehaving
...etc

If you have any ideas/advice feel free to comment or contact me:

Github =>github.com/vworri

personal website => vworri.github.io

I will post part III soon… hopefully.

Extracting the payload from Tshark directly file using Python- Part II

Written by Vera Worri