Spark: Web Server Logs Analysis with Scala

Published in

Analytics Vidhya

11 min readNov 24, 2019

Hello, welcome to the Spark cookbook series, in the previous article we made an ETL (Extract, Trandform and Load) example, where we loaded data from CSV file, cleaned the data, made some transformation to it, and at the end we loaded it to a postgreSQL database, and we export it as well as a JSON files.

In this article we will show you how simple it is to parse a web server log files and perform some analysis on it.

In general a server log is a log file (or several files) automatically created and maintained by a server consisting of a list of activities it performed. A web server log for example maintains a history of page requests. The W3C maintains a standard format (the Common Log Format) for web server log files.

For this lab, we will use a random data set from Kaggle website, and it contains all HTTP requests for one year.

The Common Logfile Format :

The common log file format is as follows:

remotehost rfc931 authuser [date] "request" status bytes

remotehost : Remote hostname (or IP number if DNS hostname is not available, or if DNSLookup is Off.
rfc931 : The remote logname of the user.
authuser : The username as which the user has authenticated himself
[date] : Date and time of the request.
“request” : The request line exactly as it came from the client.
status : The HTTP status code returned to the client.
bytes : The content-length of the document transferred.

Let’s begin :)

Pre-requisite :

This example is made using IntelliJ IDEA, in order to use spark make sure to add its library to the build.sbt file as follow :

name := "SparkLabs"

version := "0.1"

scalaVersion := "2.11.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"

// https://mvnrepository.com/artifact/org.postgresql/postgresql
libraryDependencies += "org.postgresql" % "postgresql" % "42.2.8"

I ) Loading the log file :

Load web server text file with scala using spark — Data load

Result : 
root
 |-- value: string (nullable = true)+---------------------------------------------------------------+
|value                                                          |
+---------------------------------------------------------------+
|IP,Time,URL,Staus                                              |
|10.128.2.1,[29/Nov/2017:06:58:55,GET /login.php HTTP/1.1,200   |
|10.128.2.1,[29/Nov/2017:06:59:02,POST /process.php HTTP/1.1,302|
+---------------------------------------------------------------+
only showing top 3 rows

As you can see in line 12, since our data is unstructured we used the text function to read the file, and this produce a dataframe with a single column named “value”.

Next, we have to parse the data into individual columns, and for this we will use rhe special function regexp_extract(), which take respectively as a parameter the column to parse, the regular expression and index of the group to capture as follows :

def regexp_extract(e: Column, exp: String, groupIdx: Int): ColumnExtract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.Since
1.5.0

if you are not familiar with regular expression i suggest you to check the web site below.

RegexOne - Learn Regular Expressions - Lesson 1: An Introduction, and the ABCs

Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or…

regexone.com

I think you already noticed that our log file contains only 4 fields :

remote host : 10.128.2.1
date : [29/Nov/2017:06:58:55
request : GET /login.php HTTP/1.1
status : 200

II) Parsing the log file :

Parsing web server log file using the regexp_extract function with spark using scala — Parsing web server log file

So now we divided our single column “value” to four individual column : host, timestamp, path and status, to provide the regular expression we use the triple quotes syntax so we can use special regexp characters ‘ \d, \s …’ without escaping backslashes. ^([^(\s|,)]+) means that we start our parsing from the beginning of the line ‘^’, then we capture every character which is not a whitespace or a comma ‘[^(\s|,)]’, normally the \s refer to any whitespace character, this is why we use it inside square brackets and the ‘^’.

The result of our parsing is as follow :

+----------+--------------------+------------------------+------+
|host      |timestamp           |path                    |status|
+----------+--------------------+------------------------+------+
|IP        |                    |                        |null  |
|10.128.2.1|29/Nov/2017:06:58:55|/login.php              |200   |
|10.128.2.1|29/Nov/2017:06:59:02|/process.php            |302   |
|10.128.2.1|29/Nov/2017:06:59:03|/home.php               |200   |
|10.131.2.1|29/Nov/2017:06:59:04|/js/vendor/moment.min.js|200   |
+----------+--------------------+------------------------+------+
only showing top 5 rowsroot
 |-- host: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- path: string (nullable = true)
 |-- status: integer (nullable = true)

III) data cleaning :

i) null check

Here we checked if our initial dataset contain any null value, and then we checked every column of our parsed. we found out that the parsed dataset contain 219 bad lines.

in order to count the number of null value in each column we created a function ‘count_null’ that take a column as a parameter and then filter null values and convert them to 1, after that we sum up all the 1 values to get the total number of nulls

Result :

Number of bad row in the initial dataset : 0
Number of bad rows : 219
+----+---------+----+------+
|host|timestamp|path|status|
+----+---------+----+------+
|   0|        0|   0|   219|
+----+---------+----+------+Number of bad rows : 219
+--------------------+
|          bad_status|
+--------------------+
|   IP,Time,URL,Staus|
|chmod:,cannot,'a....|
|chmod:,cannot,'er...|
|rm:,cannot,'*.o':,No|
|rm:,cannot,'a.out...|
+--------------------+

ii) Fix rows with null status :

drop rows with null value using na package

Here we made a choice to drop all lines with null value.

Result :

The count of null value : 0
Before : 16008 | After : 15789

iii) Parsing the timestamp :

Result :

+--------------------+
|to_date(`timestamp`)|
+--------------------+
|                null|
|                null|
+--------------------+
only showing top 2 rowsroot
 |-- host: string (nullable = true)
 |-- path: string (nullable = true)
 |-- status: integer (nullable = true)
 |-- time: timestamp (nullable = true)+----------+------------+------+-------------------+
|      host|        path|status|               time|
+----------+------------+------+-------------------+
|10.128.2.1|  /login.php|   200|2017-11-29 06:58:55|
|10.128.2.1|/process.php|   302|2017-11-29 06:59:02|
+----------+------------+------+-------------------+
only showing top 2 rows

IV) Analysis walk-through :

i) distribution of our dataset :

Result :

status column statistics :
+-------+------------------+
|summary|            status|
+-------+------------------+
|  count|             15789|
|   mean|230.19469250744189|
| stddev| 50.05853522906924|
|    min|               200|
|    max|               404|
+-------+------------------+HTTP status analysis :
+------+-----+
|status|count|
+------+-----+
|   200|11330|
|   206|   52|
|   302| 3498|
|   304|  658|
|   404|  251|
+------+-----+Frequent Hosts :
+----------+-----+
|      host|count|
+----------+-----+
|10.131.2.1| 1626|
|10.128.2.1| 4257|
|10.130.2.1| 4056|
|10.131.0.1| 4198|
|10.129.2.1| 1652|
+----------+-----+
+--------------------+-----+
|                path|count|
+--------------------+-----+
|          /login.php| 3298|
|           /home.php| 2653|
|/js/vendor/modern...| 1417|
|                   /|  862|
|/contestproblem.p...|  467|
|  /css/normalize.css|  408|
|/css/bootstrap.mi...|  404|
|/css/font-awesome...|  399|
|      /css/style.css|  395|
|       /css/main.css|  394|
|/js/vendor/jquery...|  387|
|/bootstrap-3.3.7/...|  382|
|        /process.php|  317|
|        /contest.php|  249|
|        /archive.php|  246|
|/fonts/fontawesom...|  245|
|         /robots.txt|  224|
|       /img/ruet.png|  213|
|/bootstrap-3.3.7/...|  191|
|/js/vendor/moment...|  173|
+--------------------+-----+
only showing top 20 rowsTop Paths :
+--------------------+-----+
|                path|count|
+--------------------+-----+
|          /login.php| 3298|
|           /home.php| 2653|
|/js/vendor/modern...| 1417|
|                   /|  862|
|/contestproblem.p...|  467|
|  /css/normalize.css|  408|
|/css/bootstrap.mi...|  404|
|/css/font-awesome...|  399|
|      /css/style.css|  395|
|       /css/main.css|  394|
+--------------------+-----+
only showing top 10 rows

ii) Analyzing Web Server log file :

Result :

Top Ten Error Paths :
+--------------------+-----+
|                path|count|
+--------------------+-----+
|           /home.php| 2167|
|                   /|  741|
|        /process.php|  317|
|         /robots.txt|  224|
|         /action.php|   83|
|/contestproblem.p...|   74|
|/js/vendor/jquery...|   73|
|/js/vendor/modern...|   72|
|/css/bootstrap.mi...|   72|
|       /css/main.css|   68|
+--------------------+-----+
only showing top 10 rowsUnique hosts : 5Number of Unique Daily Hosts :
+---+----+-----+
|day|year|count|
+---+----+-----+
|311|2017|    1|
|312|2017|    5|
|313|2017|    5|
|314|2017|    5|
|315|2017|    5|
+---+----+-----+
only showing top 5 rowsAverage Number of Daily Requests per Host :
+---+----+------------------------+
|day|year|avg_req_per_host_per_day|
+---+----+------------------------+
|335|2017|                    93.6|
|327|2017|                    76.0|
| 60|2018|      10.333333333333334|
|350|2017|      51.666666666666664|
| 46|2018|       6.666666666666667|
+---+----+------------------------+
only showing top 5 rows

iii) Exploring 404 status code :

Let’s drill down and explore the error 404 status records, We’ve all seen those “404 Not Found” web pages. 404 errors are returned when the server cannot find the resource (page or object) the browser client requested.

Result :

Counting 404 Response Codes :
found 251 404 UrlsListing 404 Status Code Records :
+---------------------------------------+
|path                                   |
+---------------------------------------+
|/css/bootstrap.min.css.map             |
|/robots.txt                            |
|/djs/vendor/bootstrap-datetimepicker.js|
|/favicon.ico                           |
+---------------------------------------+Listing The Top Twenty 404 Response Code Paths :
+---------------------------------------+-----+
|path                                   |count|
+---------------------------------------+-----+
|/css/bootstrap.min.css.map             |1    |
|/djs/vendor/bootstrap-datetimepicker.js|7    |
|/favicon.ico                           |19   |
|/robots.txt                            |224  |
+---------------------------------------+-----++--------------------+--------------------+-------------+
|                path|  collect_list(host)|count(status)|
+--------------------+--------------------+-------------+
|/css/bootstrap.mi...|        [10.130.2.1]|            1|
|/djs/vendor/boots...|[10.131.0.1, 10.1...|            7|
|        /favicon.ico|[10.128.2.1, 10.1...|           19|
|         /robots.txt|[10.131.0.1, 10.1...|          224|
+--------------------+--------------------+-------------++--------------------+--------------------+-------------+
|                path|   collect_set(host)|count(status)|
+--------------------+--------------------+-------------+
|/css/bootstrap.mi...|        [10.130.2.1]|            1|
|/djs/vendor/boots...|[10.130.2.1, 10.1...|            7|
|        /favicon.ico|[10.130.2.1, 10.1...|           19|
|         /robots.txt|[10.130.2.1, 10.1...|          224|
+--------------------+--------------------+-------------+Listing the Top Twenty-five 404 Response Code Hosts :
+----------+-----+
|host      |count|
+----------+-----+
|10.128.2.1|67   |
|10.131.0.1|61   |
|10.130.2.1|52   |
|10.129.2.1|41   |
|10.131.2.1|30   |
+----------+-----+ Listing 404 Errors per Day :
+---+----+-----+
|day|year|count|
+---+----+-----+
|312|2017|    8|
|313|2017|   10|
|314|2017|    6|
|315|2017|   12|
|316|2017|    6|
|317|2017|   10|
|318|2017|   18|
|319|2017|    8|
|320|2017|   10|
|321|2017|    5|
+---+----+-----+
only showing top 10 rows

Summary :

In this lab we learned how to load data and parse it with regexp_extract function, then we cleaned the data with the help of na package, we then convert the date column to a valid timestamp using a UDF and finally we explored the data and made some analysis on it.

Appendix :

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.{regexp_extract,sum,col,to_date,udf,to_timestamp,desc,dayofyear,year}


object WebLog extends App {

  Logger.getLogger("org").setLevel(Level.OFF)
  val spark = SparkSession.builder().appName("WebLog").master("local[*]").getOrCreate()
  import spark.implicits._
  //this will produce a dataframe with a single column called value
  val base_df = spark.read.text("/home/achraf/Desktop/labo/SparkLab/datasets/weblog.csv")
  base_df.printSchema()

  //let's look at some of the data
  base_df.show(3,false)

  /* So our data is organized as follows
  |remote host  : 10.128.2.1
  | date        : [29/Nov/2017:06:58:55
  | request     : GET /login.php HTTP/1.1
  | status      : 200
   */

  /*
      Parsing the log file
   */
  val parsed_df = base_df.select(regexp_extract($"value","""^([^(\s|,)]+)""",1).alias("host"),
    regexp_extract($"value","""^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})""",1).as("timestamp"),
    regexp_extract($"value","""^.*\w+\s+([^\s]+)\s+HTTP.*""",1).as("path"),
    regexp_extract($"value","""^.*,([^\s]+)$""",1).cast("int").alias("status"))
  parsed_df.show(5,false)
  parsed_df.printSchema()

  /*
     Data cleaning
   */
  // check if the initial dataset contain any null values
  println("Number of bad row in the initial dataset : " + base_df.filter($"value".isNull).count())

  // lets see our parsed dataset
  val bad_rows_df = parsed_df.filter($"host".isNull || $"timestamp".isNull || $"path".isNull || $"status".isNull)
  println("Number of bad rows : " + bad_rows_df.count())
  // same result as the previous statement but with different syntax
  //val bad_rows_df = parsed_df.filter($"host".isNull.or($"timestamp".isNull).or($"path".isNull)
  // .or($"status".isNull)).count

  // lets count number of null values in each column
  // we create a function that convert null value to one and then we count the number of one value
  def count_null(col_name: Column): Column = sum(col_name.isNull.cast("int")).alias(col_name.toString())
  val t = parsed_df.columns.map(col_name => count_null(col(col_name)))
  parsed_df.select(t: _*).show()

  // So all the null values are in status column, let's check what does it contain
  val bad_status_df = base_df.select(regexp_extract($"value","""([^\d]+)$""",1).as("bad_status"))
    .filter($"bad_status".notEqual(""))
  println("Number of bad rows : " + bad_status_df.count())
  // So the bad content correspond to error result, in our case this is just polluting our logs and our results
  bad_status_df.show(5)


  /*
       Fix the rows with null status
   */

  // we have two option, replace the null value by some other meaningful value, or delete the whole line
  // we will go with the other option since those lines are with no value for us
  // we will use the na subpackage on a dataframe
  val cleaned_df = parsed_df.na.drop()

  // let's check that we don't have any null value
  println("The count of null value : " + cleaned_df.filter($"host".isNull || $"timestamp".isNull || $"path".isNull
    || $"status".isNull).count())
  // count before and after
  println("Before : " + parsed_df.count() + " | After : " + cleaned_df.count())

  /*
       Parsing the timestamp
   */
  // let's try to cast the timestamp column to date
  // surprised ! we got null value, that's because when spark is not able to convert a date value
  // it just return null
  cleaned_df.select(to_date($"timestamp")).show(2)

  // Let's fix this by converting the timestamp column to the format spark knows
  val month_map = Map("Jan" -> 1, "Feb" -> 2, "Mar" -> 3, "Apr" -> 4, "May" -> 5, "Jun" -> 6, "Jul" -> 7, "Aug" -> 8
    , "Sep" -> 9, "Oct" -> 10, "Nov" -> 11, "Dec" -> 12)
  def parse_clf_time(s: String) ={
    "%3$s-%2$s-%1$s %4$s:%5$s:%6$s".format(s.substring(0,2),month_map(s.substring(3,6)),s.substring(7,11)
      ,s.substring(12,14),s.substring(15,17),s.substring(18))
  }

  val toTimestamp = udf[String, String](parse_clf_time(_))
  val logs_df = cleaned_df.select($"*",to_timestamp(toTimestamp($"timestamp")).alias("time"))
    .drop("timestamp")
  logs_df.printSchema()
  logs_df.show(2)
  // We cache the dataset so the next action would be faster
  logs_df.cache()

  //       ====<  Analysis walk-trough  >====

  /*
       status column statistics
   */
  logs_df.describe("status").show()
  /*
       HTTP status analysis
   */
  logs_df.groupBy("status").count().sort("status").show()

  /*
       Frequent Hosts
   */
  logs_df.groupBy("host").count().filter($"count" > 10).show()
  /*
       Visualizing Paths
   */
  logs_df.groupBy("path").count().sort(desc("count")).show()
  /*
       Top Paths
   */
  logs_df.groupBy("path").count().sort(desc("count")).show(10)

  //       ====< Analyzing Web Server Log File >====

  /*
       Top Ten Error Paths
   */
  logs_df.filter($"status" =!= 200).groupBy("path").count().sort(desc("count"))
    .show(10)
  /*
       Number of unique Hosts
   */
  val unique_host_count = logs_df.select("host").distinct().count()
  println("Unique hosts : %d".format(unique_host_count))
  /*
       Number of Unique Daily Hosts :
   */
  val daily_hosts_df = logs_df.withColumn("day",dayofyear($"time")).withColumn("year",year($"time"))
    .select("host","day","year").distinct().groupBy("day","year")
    .count().sort("year","day").cache()
  daily_hosts_df.show(5)
  /*
       Average Number of Daily Requests per Host
   */
  val total_req_per_day_df = logs_df.withColumn("day", dayofyear($"time"))
    .withColumn("year", year($"time"))
    .groupBy("day", "year").count()
  val avg_daily_request_per_host_df = total_req_per_day_df.join(daily_hosts_df,
    total_req_per_day_df("day") === daily_hosts_df("day")
    && total_req_per_day_df("year") === daily_hosts_df("year"))
    .select(daily_hosts_df("day"), daily_hosts_df("year"),
      (total_req_per_day_df("count") / daily_hosts_df("count")).alias("avg_req_per_host_per_day")).cache()
  avg_daily_request_per_host_df.show(5)

  //      ====< Exploring 404 status codes >====
  /*
      Let's drill down and explore the error 404 status records, We've all seen those "404 Not Found" web pages.
      404 errors are returned when the server cannot find the resource (page or object) the browser client requested.
   */

    //   Counting 404 Response Codes
  val not_found_df = logs_df.where($"status" === 404).cache()
  println("found %d 404 Urls".format(not_found_df.count()))

    //   Listing 404 Status Code Records
  not_found_df.select("path").distinct().show(40,false)

    //   Listing The Top Twenty 404 Response Code Paths :
  not_found_df.groupBy("path").count().sort("count").show(20,false)
  not_found_df.groupBy("path").agg("host" -> "collect_list","status" -> "count")
    .sort("count(status)").show(20)
  not_found_df.groupBy("path").agg("host" -> "collect_set","status" -> "count")
    .sort("count(status)").show(20)

    //   Listing the Top Twenty-five 404 Response Code Hosts
  not_found_df.groupBy("host").count().sort(desc("count")).show(truncate = false)

    //   Listing 404 Errors per Day
  val errors_by_date_pair_df = not_found_df.withColumn("day", dayofyear($"time"))
    .withColumn("year", year($"time")).groupBy("day","year").count()
  not_found_df.withColumn("day", dayofyear($"time")).withColumn("year", year($"time"))
    .groupBy("day","year").count().sort($"year",$"day").show(10)

}

Spark: Web Server Logs Analysis with Scala

The Common Logfile Format :

Let’s begin :)

Pre-requisite :

I ) Loading the log file :

RegexOne - Learn Regular Expressions - Lesson 1: An Introduction, and the ABCs

Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or…

II) Parsing the log file :

III) data cleaning :

i) null check

ii) Fix rows with null status :

iii) Parsing the timestamp :

IV) Analysis walk-through :

i) distribution of our dataset :

ii) Analyzing Web Server log file :

iii) Exploring 404 status code :

Summary :

Appendix :

Written by achraf el gdaouni