Analyze MongoDB Logs Using PySpark

Himanshu Sharma
HackerNoon.com
5 min readJul 27, 2019

--

“Without big data, you are blind and deaf and in the middle of a freeway.” — Geoffrey Moore

These days big data has become the most essential part of the computer world. One important part is to read and analyze a huge amount of logs generated by different types of databases or other services or products which we use. Since MongoDB has entered into our ecosystem, so it has now become vital to understand the mongo logs and extract some useful information from it. I read and researched about how to evaluate these crores of lines which led me to PySpark.

In this article, I am assuming that PySpark is already installed (if not installed, I have added URLs at the end of the article which will help you install the PySpark). MongoDB log messages are in the following format:

You can read more details about MongoDB log messages on this link. You will get to know more about messages and the regular expressions which we are going to build.
First of all load the MongoDB log file in the program

We will be using regular expressions to segregate different sections in the message.

After this, we will be building up a tabular data frame on which can apply different filters and perform queries.

Following is the output of the above two statements:
DataFrame[timestamp: string, severity: string, component: string, context: string, message: string]

Output: logs_df.show(10, truncate=True)

Now, we have the complete file in a data frame so we can perform different operations like:

a. Since MongoDB logs each query which takes more than 100ms to execute. We can filter out the queries which are taking more time to execute. Following is the log of a query:

|2019–07–15T05:55:52.236+0000|I |COMMAND |conn24971548 | command <dbname>.<collection_name> command: find { find: “<collection_name>”, filter: { requestId: “0286aedf-e042–4961–8ffc-578847fabf15” }, projection: { _id: 0 }, limit: 1, singleBatch: true, lsid: { id: UUID(“70382cde-19a2e2f74fbfd”) }, $clusterTime: { clusterTime: Timestamp(1563170121, 8), signature: { keyId: 665826205825, hash: BinData(0, 095B915FDE2D3597D708E54E) } }, $db: <db_name>, $readPreference: { mode: “primaryPreferred” } } planSummary: COLLSCAN keysExamined:0 docsExamined:175254 cursorExhausted:1 numYields:1397 nreturned:1 reslen:2271 locks:{ Global: { acquireCount: { r: 1398 } }, Database: { acquireCount: { r: 1398 } }, Collection: { acquireCount: { r: 1398 } } } storage:{ data: { bytesRead: 516242193, timeReadingMicros: 2598279 } } protocol:op_msg 3666ms

As you can see in the log, time is logged at the end of the statement, we can use this time to filter out such queries. We can use following function to filter out the queries which are taking longer than 3000ms (change according to your requirement). Also, the ‘component’ in these types of logs is equal to “COMMAND” or “WRITE”, therefore, we need to filter out these two components also.

Showing 10 rows of filtered queries. Mention truncate=False to see complete queries

Following is the complete program:

Since we have the complete data frame, we can filter out logs according to our requirements. Do let me know if there are any queries for filtering out the logs or any other issues. Also, comment if I can improve this or if there is an issue in the program.

--

--