Parsing MongoDB Logs

I recently had the opportunity to analyze some MongoDB logs, hunting for suspicious activity. I wanted to share some of the trials and tribulations of the analysis, as well as some tools I created during the process. I’ve talked about MongoDB logs in a previous post, however there are a few items to note that sometimes make analysis difficult (or scripting more fun!!). Note that all of this analysis was done on 2.x MongoDB logs.

Let’s look at a quick sample:

2016-01-09T19:06:29.731+0000 [initandlisten] connection accepted from #1000 (13 connections now open) 
2016-01-09T19:06:35.770+0000 [conn1000] end connection (12 connections now open)

When attempting to parse/normalize these logs, a few things jumped out to me:

  • When initiated, connections are assigned a connection number.
  • IP addresses and the corresponding port number are logged.
  • Activity from active sessions are preceded by a timestamp and the relevant connection number #
  • Initiated and ending log entries are fairly static and straightforward; these are easy to wrap a parser around.

So essentially, there are a bunch of data points that can be used to build timeframes of connectivity/activity, as well as checksums. A couple of problems I identified:

  • If logs rotate (the server/process is restarted, logs are forced to roll, or whatever the issue may be), connection numbers may reset. So cat-ing all the logs together may cause duplication issues.
  • Capturing and profiling activity during each session is not as straightforward as other log entries. We’ll come back to these.

To support these scripts and future MongoDB development, I put a new subfolder in my tools GitHub repo, just for scripts like this. You can find both scripts below at that link.

I’d also like to note that there are some fantastic MongoDB log tools over at this repository as well.

Profiling Connection Timeframes

I wanted to be able to analyze timeframes of each connection to look for anything suspicious, but as you can tell the logs overlap like crazy.

usage: [-h] [-o FILE] [-l LOG] 
Parse connection timeframes from MongoDB 2.x logs
optional arguments: 
-h, --help show this help message and exit
-l LOG, --log LOG log to parse
-o FILE, --out FILE File to output to [default is <log_name>_timeframes.csv]

As you can guess, the script will parse through a MongoDB log file and concatenate session information. Our sample information from above would instead look like this:

Connection ID,Start Time,End Time,IP Address,Port 

Much cleaner, and easier for analysis. I’ll be updating the script shortly to handle timezones better (I was lucky to have UTC), additional output options, and connection length calculations.

Link to script

Profiling Connection Activity

Along with the timeframes, I also wanted to profile what may be going on during a session. This script was a bit trickier to write (and is still a work in progress) as fields may change drastically. Here’s a sample:

2016-01-09T19:06:32.731+0000 [conn1000] query collection1.person query: { $query: { age: 20 }, $orderby: { date: 1 } } planSummary: COLLSCAN cursorid:287245572650 ntoreturn:0 ntoskip:0 nscanned:15000 nscannedObjects:15000 scanAndOrder:1 keyUpdates:0 numYields:1 r:178121 nreturned:101 reslen:45000 133ms

Above, we have a query issued by my session, connection ID 1000. See above for more details. In seeking to answer “what happened during the session”, a few fields jump out to me:

  • query — this is the “action” taking place. This can also be other commands such as delete, update, etc.
  • collection1.person1 — this is the collection the query was accessing. May be useful to know if I have a malicious session, what was being accessed.
  • nreturned:101 — nreturned corresponds to the number of documents returned by the query. While this number may be changed based on pagination and DB defaults, I can get an idea of how many records the session may have had access to. I’ll flip it another way; if I have an attacker that accessed only 101 records out of 100,000, my job just got a lot easier.
  • reslen:45000 — reslen corresponds to the number of bytes that were returned from the query. I can get an idea of how much data the session was viewing. This may not be as useful for a simple query, however once we get into fschunks, aka raw files in MongoDB, byte count may be very important.

As previously stated, I wanted to begin interpreting what was going on during these sessions. I started with the query fields, and will be expanding to include others (such as delete, update, getmore, etc.).

usage: [-h] [-o FILE] [-l LOG] Parse session statistics from MongoDB 2.x logs optional arguments: -h, --help show this help message and exit -l LOG, --log LOG Log to Parse -o FILE, --out FILE File to output to [default is <log_name>.session_stats.csv]

The output is fairly rudimentary currently:

Connection ID,Collections Accessed 
1000,collection1.person 1107,"collection1.person,collection2.address"

I am working on cleaning up the bytes and documents returned output, which of course make the file even larger (to account for each connection + collection combo). It may be worth throwing in a detail flag, and if one wants all that, they can get it. I’ll keep working on it, and will have these updates posted shortly to GitHub.

Link to script


That’s a wrap for this post, although I will continue to improve the parsing scripts. I’ll also be sure to introduce new scripts, as my opportunity to analyze these logs continues. At the moment I’m keeping the scripts separate for ease of troubleshooting, but will probably develop a more robust DFIR tool as time is freed up.

As always, I welcome any feedback/comments/concerns. Thanks!

Originally published at on January 29, 2016.

A single golf clap? Or a long standing ovation?

By clapping more or less, you can signal to us which stories really stand out.