Statistical Analysis of Windows EventLogs with pandas

Dmitrijs Trizna
Riga Data Science Club
6 min readMay 15, 2020

Introduction

Recently, we performed analysis on high volume of Windows Event Logs in XML format. Provided file was truncated, i.e. with incorrect XML structure. Standard Python xml library does not provide any options in order to deal with such a tagging problem. Moreover, XML is difficult to query from interfaces like iPython notebooks for EDA (exploratory data analysis) purposes.

This article describes both transformation of this data into pandas DataFrame format and few most interesting event analysis examples.

TL;DR

Helpler functions for Windows Event Log transformation from XML to pd.DataFrame here.

Explorayory Data Analysis (EDA) iPython notebook with code samples on pandas query syntax.

Deal with XML

XML is one of default ways how Microsoft allows you to save your logs from within Event Viewer or by chaining Get-WinEvent with Export-CliXml in powershell:

If you can — work with csv, it is more widely supported by various data analytical tools, but sometimes one cannot affect this option as in my scenario.

By utilizing lxml library it’s possible to recover correct XML tagging:

def read_xml(FILENAME):
parser = etree.XMLParser(recover=True)
with open(FILENAME) as file:
data = file.readlines()
# ignore XML documentation's tag (1st line),
# so taking only data[1:]
raw = etree.fromstring(data[1:], parser=parser)
return raw

Still, provided XML Element class is really inconvenient for data queries.

On the other hand, such multivariate data as Windows Event Logs ideally fit pandas DataFrame structure, although pandas by default does not allow you to read XML file, especially corrupted one.

Therefore, this helper function was created in order to perform data transformation from XML format into DataFrame. Result appear to be universal for Windows Event Log parsing, so anyone interested might be able use that:

def events_to_df(eventlist):
df = pd.DataFrame()
# you may need to tune this tag according to your XML format
tag = '{http://schemas.microsoft.com/win/2004/08/events/event}'
for idx, event in enumerate(eventlist):
edict = {}
for element in event.iterdescendants():
# filter out empty fields
if any(x in element.tag for x in [’Provider’,\
'System’,\
'Correlation’]):
pass
elif any(x in element.tag for x in ['TimeCreated',\
'Execution',\
'Security']):
for item in element.items():
edict[item[0]] = item[1]
elif 'Data' in element.tag:
for item in element.items():
edict[item[1]] = element.text
else:
edict[element.tag.replace(tag,'')] = element.text

# add raw text event to have ability
# always access full value of eventlog
edict['raw'] = etree.tostring(event,\
pretty_print=True).decode()

edf = pd.DataFrame(edict, index=[idx])
df = df.append(edf, sort=True)
return df

In order to merge two functions, just collect all events by iterating over XML Element class (comes as return value from read_xml), and feed this list into events_to_df function:

# get all events in list
events = []
# we see prefix on every tag, predefine that
# you may need to tune this according to your XML format
tag = '{http://schemas.microsoft.com/win/2004/08/events/event}'
for element in raw.iter(tag+'Event'):
events.append(element)
df = events_to_df(events)

As a result analytical capabilities during work with this corpus of data is incomparable with those you could achieve when working with raw XML in Python.

For pandas newbies as example, that way you can filter out events by specific parameter/value pair:

df[df.PARAMETER == VALUE]

Here we see that only 4 events fulfill our request parameters, with ability to easy access any subparameter of filtered events.

Finding C&C channel via statistical analysis

Here comes analytical example for Proof of Concept purposes.

Below we perform statistical analysis of ‘powershell’ network connections. First of all we filter events, that were initiated by powershell:

psdf = df[df.Image.str.contains('powershell', na=False)]

Then take a look on IP addresses and services where Powershell connected and request distribution:

Only one IP and TCP port appears to be target 192.168.124.135:8080. Let’s take a look on this communication statistical parameters:

Here we:

  • filter out specific DestinationIp we’re interesing it
  • query connection request times from UtcTime value
  • sorting them from earliest to latest and taking .diff() to find difference between every request
  • .describe() then returns statistical information on values

.. from which we can conclude following:

  • maximal delay between connections is 5 s 29 ms
  • only few connections happen to strongly deviate from this norm, where 75%+ of requests happen with delay between them in interval of 5 s 1 ms to 5 s 29 ms
  • standard deviation from this trend is minimal — only 1.775 second
  • much of this deviation comes from minimal delay sample being only 2 ms long, prevailing part of connectivity has variance only of ~30 ms

From there we can conclude that this definitely is an automated process. By correlating this information with Sysmon’s EventID 1 of process creations and consequently process tree, we indicate malware’s Command and Control (C&C) channel with delay of 5 seconds and jitter of 10%.

Communication visualisation

If arguments above didn’t convinced you or make any sense, you always can visualise data in order to get grasp in more intuitive way.

Here is code sample where we display communication described above:

import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))sns.countplot(x='UtcTime', \
data=psdf[psdf.DestinationPort == '8080'], \
palette=sns.color_palette("Blues"))
plt.show()

Results leave no doubts on fact that requests are made by automotive process and allow to conclude that observations fit C&C channel hypothesis:

Few more data analysis examples

Don’t want to heavy weight this article, so only few additional easy to grasp techniques here.

EventID distribution

Just with one method we can see full list of unique EventID appearing in dataset and their count across all events:

Here we see more than 3k EventID 7, but only one event with EventID 20 and 12. May be interesting to look as being anomalous.

Unique CommandLines executed on host

You may want to take a quick look at all unique commands executed:

for cmd in df[df.EventID == '1'].CommandLine.unique():
print(cmd)

This aggregates all repeatable “service like” CommandLines and provides much easily observable list of what happened on system.

Access raw XML value

Described transform function events_to_df explicitly adds raw content of Event in XML format, and we still can access it any moment:

Epilogue

These are only few examples that hopefully show the power of abusing pandas methods for Event Log analysis. Some more you can find in eda.ipynb.

Additionally, this type of work can be easily expanded to feed Event Logs into Machine Learning or Neural Network algorithms (stay tuned for data preprocessing in separate article).

Overall, there’re many valuable techniques that may lead to powerful heuristics depending on your Event Log content.

Happy hunting!

--

--

Dmitrijs Trizna
Riga Data Science Club

Sr. Security Researcher @ Microsoft. This blog is an independent R&D at the intersection of Machine Learning and Cyber-Security.