Log Analysis in DFIR Using Jupyter Notebook

Published in

MII Cyber Security Consulting Services

6 min readJan 13, 2021

Jupyter notebook is an open-source web application that allows you to write documents containing computer code (e.g. python) and rich text elements (paragraph, equations, figures, links, etc…). It uses includes data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more. Simply put, it is a tool to visualize what you are doing with data and allows you to write explanation about the visualization.

The figure above shows how a Jupyter Notebook file looks like. We wrote and ran our lines of codes inside what is called a cell. There are three types of cell: code, raw, and markdown. Each cell type produces different output as shown in the figure. The code cell will produce an output based on what codes written which in this example is a print command. The raw cell will be treated unmodified. The markdown cell will be converted and treated as HTML and it is usually used as explanatory text.

The Jupyter Notebook often used by data scientist to analyze statistical data and visualize it. It turns out that the ability to do so could also be used in digital forensics. Digital forensics most of the time deals with application logs. The common problem with these logs is that they are usually contains of the collections events for months or years which will make its sizes quite huge for a text file. It could be also a condition that log could not be read by common text editor, for example if you’re using windows system to read a UNIX utmp file. In this article I’m using the cyberdefenders’s Hammered — Log Analysis challenge (https://cyberdefenders.org/labs/42) as an example of how we can use Jupyter Notebook to process the logs.

A log is basically just a text file. So, we can just use simple text processing provided the file wasn’t too huge. For this part, the processing of daemon.log and auth.log will be processed to answer some of the question.

Question 13, The database displayed two warning messages, provide the most important and dangerous one? The warning messages is located inside the daemon.log file and it usually started with the word “WARNING”. So, what should be done first thing is to read the file and filter lines that contains the word “WARNING.

The figure above shows the code for reading the file and filtering the lines. The results are 13 lines with 2 unique warnings, “mysql.user contains 2 root accounts without password!” and “mysqlcheck has found corrupt tables”. Since we are asked for the most dangerous one, then the answer should be the no password for root accounts warning.

Notice that there’s another function used between reading the file and filtering the lines. The deque function was used to get the lines from the end of the file. This doesn’t do much help for small file since it probably only consists of few hundred lines, but some log file may contain a million lines and what we actually need is just a portion of the most recent evets which located at the end of the file.

Question 8, Which IP address successfully logged into the system the most number of times? This information is contained in the auth.log file along the lines “Accepted password for user…”.

The figure above shows the same things that have been done for the previous question with different file and filters. Next step that have to be done is counting each unique IP address for each successful login. To do that, we have to extract the IP address from each line which will be done using regex.

This figure shows how to extract substring using python regex. The regular expression to extract IP address used was ‘\d+[.]\d+[.]\d+[.]\d+’. Each IP address extracted were being put into a list and each unique IP were counted using the Counter function. The result shows that the greatest number of times an IP successfully logged in to the system was 4 time by 219.150.161.20 and 188.131.23.37. Thus, the answer should be one of these IPs.

Question 12, When was the last login from the attacker with IP 219.150.161.20? The answer for this question can be acquired using the same file and filters as question 8. From the previous question we know that the IP 219.150.161.20 has logged in 4 time. The first 5 lines of the filtered file has already shown the last time this IP successfully logged in, which is on April 19th 05:56:05.

Question 9, How many requests were sent to Apache server? This is a simple question. We just need to count how many lines are there in the access.log file which is 365 as shown on the figure below.

Request count from apache www-access.log file

Many of the log files can be processed with basic text processing like what have be done before in this article. But most log have their own format and figuring the format of each log could be beneficial in processing them. For example, python have a library that could parse the parameter from the apache access.log.

access.log parser and pandas environment setup

The figure above shows what libraries being used and setting up the environment that will be used later to process the log. Here is the list of what these libraries usage:

1. deque, to get the end lines of a file, similar to linux tail command.

2. apachelogs to parse the access.log file.

3. pandas for data manipulation and analysis

4. warnings to supress some of the warning

Not all of these needs to be used, like the warning and deque. The warning library is used to supress annoying warning that sometimes will be shown as you ran some code and make it a little bit untidy. The deque will help us fetch the most recent events from the log by reading the file from the end. You could also not use the other three and devise your own codes, but it is much more simpler to just use what is provided. After importing the libraries, we setup the environment for processing the log. In this example I change one of the pandas settings and setup the parser to use common apache access.log format, which mean we have to throw away the last 2 parameter from each line.

Convert raw log lines to a pandas dataframe

The figure above shows the code to parse the parameters and put in a pandas dataframe so it could be shown in a tabular view. From this point, you could manipulate the dataframe in any way you like to get the data you wanted, like finding out every IP that sent requests to the server or there are attempts to send malicious POST request.

Conclusions:

The Jupyter Notebook could certainly help process logs, visualize the data, and extract information from it. However, an understanding of python programming language are necessary to maximize its potential.

What provided in this article only shows the basic way of how the Jupyter Notebook could assist in digital forensics. In there are many ways that can be developed to get the job done easier, such as automatically detecting behavior (i.e. the scanning like shown in this article) or using text similarity to multiple IP trying to do the same malicious activity.

Log Analysis in DFIR Using Jupyter Notebook

Written by Bintang Nafsul Mutmainnah