Jupyter Notebook: Log Parsing and Regex Utilization

Published in

MII Cyber Security Consulting Services

6 min readApr 3, 2023

Nowadays, we are provided many tools to parse variety of logs to conduct threat hunt, forensic investigation, or just browsing around the environment. There’s Autopsy, OSForensics, Splunk, and many more tools to ease our job. However, there are times when such tools are not available during investigation. In lieu of those tools, python or Jupyter Notebook may be utilized to parse the logs.

In order to parse these logs, we have to understand how they are written and the pattern of each logs. For this end, I’ll be using some sample case:

Windows event log
Linux login/logout
Fortiweb

Before we move on to each examples, I’ll explain the steps I took approaching this problem. First thing is knowing the log file’s format we’re going to work with, whether it is the kind that can be opened with basic text editor or the special one which can’t be such as windows event log, linux utmp, etc. With basic text log python can just open it right away and with files like xml, json, etc., although you can read it with basic open file command, there are also python libraries to parse these file formats. The same thing goes for logs that can be read using basic text editor, python has variety of libraries capable of parsing those logs.

Second thing is understanding the pattern of how each log entries are written. Understanding the log pattern allow us to choose which part of each entries we need to extract for our case. For files that parsed using specific library, they’re generally parsed into dictionary/tree variable type which means that they will have the key-value separated nicely. However, if the file were read using basic open file function, we’ll need to parse the entries using regex to get the parameter ourselves. Data extracted will then saved into separate object like a list, another dictionary, or I prefer a pandas dataframe.

Last is creating the macro function and utilizing extracted data. This last step only compiling what we’ve done before and wrapping it up nicely so that we can reuse and edit it easily. Regarding utilizing extracted data, it is up to you how you want to use it, whether you go straight to generate a chart in the notebook or save it to a spreadsheet to do other data manipulation there.

Now that we’ve established the steps, we can move on to the first example, parsing windows event log. Windows event log, even though inside it uses xml format, is the log type that cannot be read using python’s basic open file command. Instead, there is an evtx parser library for python we can utilize.

import evtx
import re
from datetime import datetime
import pandas as pd

evtx_file = 'F:/W/Security.evtx' #replace with your evtx file address

parser = evtx.PyEvtxParser(evtx_file)
parse_json = list(parser.records_json())

#time range from d1 to d2
d1 = datetime(2023, 3, 25, 0, 0, 0, 0)
d2 = datetime(2023, 3, 30, 0, 0, 0, 0)

time = []
uname = []
for pj in parse_json:
    sys_time = pj['timestamp']
    date_time_obj = datetime.strptime(sys_time, '%Y-%m-%d %H:%M:%S.%f %Z')
    if ('"EventID": 4624' in pj['data']) and (date_time_obj >= d1) and (date_time_obj < d2):
        x = re.findall("TargetUserName\": \"(.*)\",\n", pj['data'])
        if len(x) > 0:
            time += [date_time_obj]
            uname += [x[0]]
        
pd.DataFrame(data={'Time':time, 'Account Name':uname}).sort_values(by=['Time'])

The snippet code above uses evtx library to parse Security.evtx log file. From there, the log was filtered according to the date range and if it is a successful logon event which uses ID 4624. If it checked all criteria, we save the timestamp and TargetUserName (parsed using regex) which is the logged on user account into a pandas dataframe.

The figure above shows the resulting dataframe for the parsed evtx file sorted from oldest timestamp.

Linux login log such as utmp, wtmp, and btmp are another kind that can’t be read directly, but python have the parser for these files too.

import utmp
import datetime

d1 = datetime.datetime(2021, 9, 1, 0, 0, 0, 0)
d2 = datetime.datetime(2021, 9, 15, 0, 0, 0, 0)

rectime = []
rectype = []
rec = []

#Change path to wtmp file
with open('F:/wtmp', 'rb') as fd:
    buf = fd.read()
    for entry in utmp.read(buf):
        if entry.time >= d1 and entry.time < d2 and entry.user is not '':
            rectime += [entry.time]
            rectype += [entry.type]
            rec += [entry]
            print('Username:', entry.user)
            print('Host\t:', entry.host)
            print('Time\t:', entry.time.strftime('%Y-%m-%d %H:%M:%S.%f'))
            print('\n------------------------------------------\n')

The snippet code above will filter the log for specified date range, then get the username and host parameter.

The figure above shows sample output for the linux snippet code. In this case, since there is only a small number of entry in the file, the information were printed directly instead of being saved into a dataframe to demonstrate that you can present the data anyway you want, and not always puting it in a table.

The last example I’ll be using is fortiweb log to show how regex could be utilized.

import re
import pandas as pd

file_name='F:/Fortiweb/alog.txt'

f = open(file_name, 'r', errors='ignore')
r = f.read().split('\n')
f.close()

AlertDeny_Event = {}
Alert_Event = {}
Erase_Event = {}

All_Source = {}
Source_Country = {}
All_Destination = {}

for rs in r:
    x = re.findall("\",\"action=([^\"]*).*dst=(.*)\",\"dst_port.*signature_subclass=(.*),\"src=(.*)\",\"src_port",
                   rs)
    if(len(x)>0):
        #Source IP counter
        if('","srccountry=' in rs):
            y = rs.split('srccountry=')[1].split('",')[0].strip('"')
            
            if x[0][3] in All_Source.keys():
                All_Source[x[0][3]] += 1
                if (Source_Country[x[0][3]] == "Unknown"): Source_Country[x[0][3]] = y
            else:
                All_Source[x[0][3]] = 1
                Source_Country[x[0][3]] = y
        else:
            y = "Unknown"
            if x[0][3] in All_Source.keys(): All_Source[x[0][3]] += 1
            else:
                All_Source[x[0][3]] = 1
                Source_Country[x[0][3]] = y
            
        #Destination IP counter
        if x[0][1] in All_Destination.keys(): All_Destination[x[0][1]] += 1
        else: All_Destination[x[0][1]] = 1
        
        #Get Subtype if signature N/A
        if(x[0][2].strip('"')=='N/A'):
            subtype = re.findall("subtype=(.*),\"action=", rs)
            signature = "Sub-type - " + subtype[0].strip('"')
        else: signature = x[0][2].strip('"')            
    
        #Counter based on event and action:
        
        #Alert_Deny Event
        if(x[0][0]=='Alert_Deny'):
            if signature in AlertDeny_Event.keys(): AlertDeny_Event[signature] += 1
            else: AlertDeny_Event[signature] = 1
                
        #Erase Event
        if(x[0][0]=='Erase'):
            if signature in Erase_Event.keys(): Erase_Event[signature] += 1
            else: Erase_Event[signature] = 1
                
        #Alert Event
        if(x[0][0]=='Alert'):
            if signature in Alert_Event.keys(): Alert_Event[signature] += 1
            else: Alert_Event[signature] = 1    

df_alertdeny = pd.DataFrame(data={'Attack Event': list(AlertDeny_Event.keys()), 'Action':'Alert Deny',
                              'Event Count': list(AlertDeny_Event.values())}, index=None)
df_erased = pd.DataFrame(data={'Attack Event': list(Erase_Event.keys()), 'Action':'Erased',
                              'Event Count': list(Erase_Event.values())}, index=None)
df_alert = pd.DataFrame(data={'Attack Event': list(Alert_Event.keys()), 'Action':'Alert',
                              'Event Count': list(Alert_Event.values())}, index=None)

df_all = pd.concat([df_alertdeny, df_erased, df_alert])
df_all['Event Count'] = pd.to_numeric(df_all['Event Count'], downcast='integer')

df_source = pd.DataFrame(data={'Country': list(Source_Country.values()), 'Source IP': list(All_Source.keys()),
                              'IP Count': list(All_Source.values())}, index=None)

In the snippet code above there are several items I took using a single regex. Take a look at the regex parameter “\”,\”action=([^\”]*).*dst=(.*)\”,\”dst_port.*signature_subclass=(.*),\”src=(.*)\”,\”src_port”. Regex will save the string inside each parentheses. According to this, we are extracting the following items from each entry:

action (action that the WAF took)
dst (destination IP of the event)
signature_subclass (the event signature)
src (source IP of the event)

From here on, we create 5 dataframes. Three dataframes are list of events based on the action taken by WAF (Erased, Alert, or Alert Deny) which then combined into a fourth dataframe. The fifth dataframe to find origin of the event.

The figure above shows events happened and it count from the log, filtered based on the action taken by the WAF. Notice that the ‘Event Count’ column is not something we got directly from the log file, instead we process what we extracted from the log to generate another information.

This last dataframe shows list of source IP where the event came from along with it’s country origin and count. Using this information, if we have no business arrangement with a particular country, we may conclude that this IP need to be looked into further and blocked if necessary. This last example also shows that each entries can be processed only once to generate multiple information table.

Conclusions:

Jupyter Notebook is an alternate tool that can be used to parse wide variety of logs in case that a particular log reader unavailable to be used. There are several pointer when we’re using Jupyter Notebook to parse logs:

Know the log file formats
Understand how the log entries are written
Extract and utilize each entry parameters accordingly
Create the macro and tidy up the codes so it can be reused and edited easily

Jupyter Notebook: Log Parsing and Regex Utilization

Written by Bintang Nafsul Mutmainnah