Spark: Web Server Logs Analysis with Scala

achraf el gdaouni
Nov 24 · 11 min read
Web Server Log Analysis

Hello, welcome to the Spark cookbook series, in the previous article we made an ETL (Extract, Trandform and Load) example, where we loaded data from CSV file, cleaned the data, made some transformation to it, and at the end we loaded it to a postgreSQL database, and we export it as well as a JSON files.

In this article we will show you how simple it is to parse a web server log files and perform some analysis on it.

In general a server log is a log file (or several files) automatically created and maintained by a server consisting of a list of activities it performed. A web server log for example maintains a history of page requests. The W3C maintains a standard format (the Common Log Format) for web server log files.

For this lab, we will use a random data set from Kaggle website, and it contains all HTTP requests for one year.

The Common Logfile Format :

The common log file format is as follows:

remotehost rfc931 authuser [date] "request" status bytes
  • remotehost : Remote hostname (or IP number if DNS hostname is not available, or if DNSLookup is Off.
  • rfc931 : The remote logname of the user.
  • authuser : The username as which the user has authenticated himself
  • [date] : Date and time of the request.
  • “request” : The request line exactly as it came from the client.
  • status : The HTTP status code returned to the client.
  • bytes : The content-length of the document transferred.

Let’s begin :)


Pre-requisite :

This example is made using IntelliJ IDEA, in order to use spark make sure to add its library to the build.sbt file as follow :

name := "SparkLabs"

version := "0.1"

scalaVersion := "2.11.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"

// https://mvnrepository.com/artifact/org.postgresql/postgresql
libraryDependencies += "org.postgresql" % "postgresql" % "42.2.8"

I ) Loading the log file :

Load web server text file with scala using spark
Load web server text file with scala using spark
Data load
Result : 
root
|-- value: string (nullable = true)
+---------------------------------------------------------------+
|value |
+---------------------------------------------------------------+
|IP,Time,URL,Staus |
|10.128.2.1,[29/Nov/2017:06:58:55,GET /login.php HTTP/1.1,200 |
|10.128.2.1,[29/Nov/2017:06:59:02,POST /process.php HTTP/1.1,302|
+---------------------------------------------------------------+
only showing top 3 rows

As you can see in line 12, since our data is unstructured we used the text function to read the file, and this produce a dataframe with a single column named “value”.

Next, we have to parse the data into individual columns, and for this we will use rhe special function regexp_extract(), which take respectively as a parameter the column to parse, the regular expression and index of the group to capture as follows :

def regexp_extract(e: Column, exp: String, groupIdx: Int): ColumnExtract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.Since
1.5.0

if you are not familiar with regular expression i suggest you to check the web site below.

I think you already noticed that our log file contains only 4 fields :

  • remote host : 10.128.2.1
  • date : [29/Nov/2017:06:58:55
  • request : GET /login.php HTTP/1.1
  • status : 200

II) Parsing the log file :

Parsing web server log file using the regexp_extract function with spark using scala
Parsing web server log file using the regexp_extract function with spark using scala
Parsing web server log file

So now we divided our single column “value” to four individual column : host, timestamp, path and status, to provide the regular expression we use the triple quotes syntax so we can use special regexp characters ‘ \d, \s …’ without escaping backslashes. ^([^(\s|,)]+) means that we start our parsing from the beginning of the line ‘^’, then we capture every character which is not a whitespace or a comma ‘[^(\s|,)]’, normally the \s refer to any whitespace character, this is why we use it inside square brackets and the ‘^’.

The result of our parsing is as follow :

+----------+--------------------+------------------------+------+
|host |timestamp |path |status|
+----------+--------------------+------------------------+------+
|IP | | |null |
|10.128.2.1|29/Nov/2017:06:58:55|/login.php |200 |
|10.128.2.1|29/Nov/2017:06:59:02|/process.php |302 |
|10.128.2.1|29/Nov/2017:06:59:03|/home.php |200 |
|10.131.2.1|29/Nov/2017:06:59:04|/js/vendor/moment.min.js|200 |
+----------+--------------------+------------------------+------+
only showing top 5 rows
root
|-- host: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- path: string (nullable = true)
|-- status: integer (nullable = true)

III) data cleaning :

i) null check

Data cleaning

Here we checked if our initial dataset contain any null value, and then we checked every column of our parsed. we found out that the parsed dataset contain 219 bad lines.

in order to count the number of null value in each column we created a function ‘count_null’ that take a column as a parameter and then filter null values and convert them to 1, after that we sum up all the 1 values to get the total number of nulls

Result :

Number of bad row in the initial dataset : 0
Number of bad rows : 219
+----+---------+----+------+
|host|timestamp|path|status|
+----+---------+----+------+
| 0| 0| 0| 219|
+----+---------+----+------+
Number of bad rows : 219
+--------------------+
| bad_status|
+--------------------+
| IP,Time,URL,Staus|
|chmod:,cannot,'a....|
|chmod:,cannot,'er...|
|rm:,cannot,'*.o':,No|
|rm:,cannot,'a.out...|
+--------------------+

ii) Fix rows with null status :

drop rows with null value using na package

Here we made a choice to drop all lines with null value.

Result :

The count of null value : 0
Before : 16008 | After : 15789

iii) Parsing the timestamp :

Parsing string to timestamp

Result :

+--------------------+
|to_date(`timestamp`)|
+--------------------+
| null|
| null|
+--------------------+
only showing top 2 rows
root
|-- host: string (nullable = true)
|-- path: string (nullable = true)
|-- status: integer (nullable = true)
|-- time: timestamp (nullable = true)
+----------+------------+------+-------------------+
| host| path|status| time|
+----------+------------+------+-------------------+
|10.128.2.1| /login.php| 200|2017-11-29 06:58:55|
|10.128.2.1|/process.php| 302|2017-11-29 06:59:02|
+----------+------------+------+-------------------+
only showing top 2 rows

IV) Analysis walk-through :

i) distribution of our dataset :

data distribution

Result :

status column statistics :
+-------+------------------+
|summary| status|
+-------+------------------+
| count| 15789|
| mean|230.19469250744189|
| stddev| 50.05853522906924|
| min| 200|
| max| 404|
+-------+------------------+
HTTP status analysis :
+------+-----+
|status|count|
+------+-----+
| 200|11330|
| 206| 52|
| 302| 3498|
| 304| 658|
| 404| 251|
+------+-----+
Frequent Hosts :
+----------+-----+
| host|count|
+----------+-----+
|10.131.2.1| 1626|
|10.128.2.1| 4257|
|10.130.2.1| 4056|
|10.131.0.1| 4198|
|10.129.2.1| 1652|
+----------+-----+

+--------------------+-----+
| path|count|
+--------------------+-----+
| /login.php| 3298|
| /home.php| 2653|
|/js/vendor/modern...| 1417|
| /| 862|
|/contestproblem.p...| 467|
| /css/normalize.css| 408|
|/css/bootstrap.mi...| 404|
|/css/font-awesome...| 399|
| /css/style.css| 395|
| /css/main.css| 394|
|/js/vendor/jquery...| 387|
|/bootstrap-3.3.7/...| 382|
| /process.php| 317|
| /contest.php| 249|
| /archive.php| 246|
|/fonts/fontawesom...| 245|
| /robots.txt| 224|
| /img/ruet.png| 213|
|/bootstrap-3.3.7/...| 191|
|/js/vendor/moment...| 173|
+--------------------+-----+
only showing top 20 rows
Top Paths :
+--------------------+-----+
| path|count|
+--------------------+-----+
| /login.php| 3298|
| /home.php| 2653|
|/js/vendor/modern...| 1417|
| /| 862|
|/contestproblem.p...| 467|
| /css/normalize.css| 408|
|/css/bootstrap.mi...| 404|
|/css/font-awesome...| 399|
| /css/style.css| 395|
| /css/main.css| 394|
+--------------------+-----+
only showing top 10 rows

ii) Analyzing Web Server log file :

data exploration

Result :

Top Ten Error Paths :
+--------------------+-----+
| path|count|
+--------------------+-----+
| /home.php| 2167|
| /| 741|
| /process.php| 317|
| /robots.txt| 224|
| /action.php| 83|
|/contestproblem.p...| 74|
|/js/vendor/jquery...| 73|
|/js/vendor/modern...| 72|
|/css/bootstrap.mi...| 72|
| /css/main.css| 68|
+--------------------+-----+
only showing top 10 rows
Unique hosts : 5Number of Unique Daily Hosts :
+---+----+-----+
|day|year|count|
+---+----+-----+
|311|2017| 1|
|312|2017| 5|
|313|2017| 5|
|314|2017| 5|
|315|2017| 5|
+---+----+-----+
only showing top 5 rows
Average Number of Daily Requests per Host :
+---+----+------------------------+
|day|year|avg_req_per_host_per_day|
+---+----+------------------------+
|335|2017| 93.6|
|327|2017| 76.0|
| 60|2018| 10.333333333333334|
|350|2017| 51.666666666666664|
| 46|2018| 6.666666666666667|
+---+----+------------------------+
only showing top 5 rows

iii) Exploring 404 status code :

Let’s drill down and explore the error 404 status records, We’ve all seen those “404 Not Found” web pages. 404 errors are returned when the server cannot find the resource (page or object) the browser client requested.

404 status code analysis

Result :

Counting 404 Response Codes :
found 251 404 Urls
Listing 404 Status Code Records :
+---------------------------------------+
|path |
+---------------------------------------+
|/css/bootstrap.min.css.map |
|/robots.txt |
|/djs/vendor/bootstrap-datetimepicker.js|
|/favicon.ico |
+---------------------------------------+
Listing The Top Twenty 404 Response Code Paths :
+---------------------------------------+-----+
|path |count|
+---------------------------------------+-----+
|/css/bootstrap.min.css.map |1 |
|/djs/vendor/bootstrap-datetimepicker.js|7 |
|/favicon.ico |19 |
|/robots.txt |224 |
+---------------------------------------+-----+
+--------------------+--------------------+-------------+
| path| collect_list(host)|count(status)|
+--------------------+--------------------+-------------+
|/css/bootstrap.mi...| [10.130.2.1]| 1|
|/djs/vendor/boots...|[10.131.0.1, 10.1...| 7|
| /favicon.ico|[10.128.2.1, 10.1...| 19|
| /robots.txt|[10.131.0.1, 10.1...| 224|
+--------------------+--------------------+-------------+
+--------------------+--------------------+-------------+
| path| collect_set(host)|count(status)|
+--------------------+--------------------+-------------+
|/css/bootstrap.mi...| [10.130.2.1]| 1|
|/djs/vendor/boots...|[10.130.2.1, 10.1...| 7|
| /favicon.ico|[10.130.2.1, 10.1...| 19|
| /robots.txt|[10.130.2.1, 10.1...| 224|
+--------------------+--------------------+-------------+
Listing the Top Twenty-five 404 Response Code Hosts :
+----------+-----+
|host |count|
+----------+-----+
|10.128.2.1|67 |
|10.131.0.1|61 |
|10.130.2.1|52 |
|10.129.2.1|41 |
|10.131.2.1|30 |
+----------+-----+
Listing 404 Errors per Day :
+---+----+-----+
|day|year|count|
+---+----+-----+
|312|2017| 8|
|313|2017| 10|
|314|2017| 6|
|315|2017| 12|
|316|2017| 6|
|317|2017| 10|
|318|2017| 18|
|319|2017| 8|
|320|2017| 10|
|321|2017| 5|
+---+----+-----+
only showing top 10 rows

Summary :

In this lab we learned how to load data and parse it with regexp_extract function, then we cleaned the data with the help of na package, we then convert the date column to a valid timestamp using a UDF and finally we explored the data and made some analysis on it.

Appendix :

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{Column, SparkSession}
import org.apache.spark.sql.functions.{regexp_extract,sum,col,to_date,udf,to_timestamp,desc,dayofyear,year}


object WebLog extends App {

Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder().appName("WebLog").master("local[*]").getOrCreate()
import spark.implicits._
//this will produce a dataframe with a single column called value
val base_df = spark.read.text("/home/achraf/Desktop/labo/SparkLab/datasets/weblog.csv")
base_df.printSchema()

//let's look at some of the data
base_df.show(3,false)

/* So our data is organized as follows
|remote host : 10.128.2.1
| date : [29/Nov/2017:06:58:55
| request : GET /login.php HTTP/1.1
| status : 200
*/

/*
Parsing the log file
*/
val parsed_df = base_df.select(regexp_extract($"value","""^([^(\s|,)]+)""",1).alias("host"),
regexp_extract($"value","""^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})""",1).as("timestamp"),
regexp_extract($"value","""^.*\w+\s+([^\s]+)\s+HTTP.*""",1).as("path"),
regexp_extract($"value","""^.*,([^\s]+)$""",1).cast("int").alias("status"))
parsed_df.show(5,false)
parsed_df.printSchema()

/*
Data cleaning
*/
// check if the initial dataset contain any null values
println("Number of bad row in the initial dataset : " + base_df.filter($"value".isNull).count())

// lets see our parsed dataset
val bad_rows_df = parsed_df.filter($"host".isNull || $"timestamp".isNull || $"path".isNull || $"status".isNull)
println("Number of bad rows : " + bad_rows_df.count())
// same result as the previous statement but with different syntax
//val bad_rows_df = parsed_df.filter($"host".isNull.or($"timestamp".isNull).or($"path".isNull)
// .or($"status".isNull)).count

// lets count number of null values in each column
// we create a function that convert null value to one and then we count the number of one value
def count_null(col_name: Column): Column = sum(col_name.isNull.cast("int")).alias(col_name.toString())
val t = parsed_df.columns.map(col_name => count_null(col(col_name)))
parsed_df.select(t: _*).show()

// So all the null values are in status column, let's check what does it contain
val bad_status_df = base_df.select(regexp_extract($"value","""([^\d]+)$""",1).as("bad_status"))
.filter($"bad_status".notEqual(""))
println("Number of bad rows : " + bad_status_df.count())
// So the bad content correspond to error result, in our case this is just polluting our logs and our results
bad_status_df.show(5)


/*
Fix the rows with null status
*/

// we have two option, replace the null value by some other meaningful value, or delete the whole line
// we will go with the other option since those lines are with no value for us
// we will use the na subpackage on a dataframe
val cleaned_df = parsed_df.na.drop()

// let's check that we don't have any null value
println("The count of null value : " + cleaned_df.filter($"host".isNull || $"timestamp".isNull || $"path".isNull
|| $"status".isNull).count())
// count before and after
println("Before : " + parsed_df.count() + " | After : " + cleaned_df.count())

/*
Parsing the timestamp
*/
// let's try to cast the timestamp column to date
// surprised ! we got null value, that's because when spark is not able to convert a date value
// it just return null
cleaned_df.select(to_date($"timestamp")).show(2)

// Let's fix this by converting the timestamp column to the format spark knows
val month_map = Map("Jan" -> 1, "Feb" -> 2, "Mar" -> 3, "Apr" -> 4, "May" -> 5, "Jun" -> 6, "Jul" -> 7, "Aug" -> 8
, "Sep" -> 9, "Oct" -> 10, "Nov" -> 11, "Dec" -> 12)
def parse_clf_time(s: String) ={
"%3$s-%2$s-%1$s %4$s:%5$s:%6$s".format(s.substring(0,2),month_map(s.substring(3,6)),s.substring(7,11)
,s.substring(12,14),s.substring(15,17),s.substring(18))
}

val toTimestamp = udf[String, String](parse_clf_time(_))
val logs_df = cleaned_df.select($"*",to_timestamp(toTimestamp($"timestamp")).alias("time"))
.drop("timestamp")
logs_df.printSchema()
logs_df.show(2)
// We cache the dataset so the next action would be faster
logs_df.cache()

// ====< Analysis walk-trough >====

/*
status column statistics
*/
logs_df.describe("status").show()
/*
HTTP status analysis
*/
logs_df.groupBy("status").count().sort("status").show()

/*
Frequent Hosts
*/
logs_df.groupBy("host").count().filter($"count" > 10).show()
/*
Visualizing Paths
*/
logs_df.groupBy("path").count().sort(desc("count")).show()
/*
Top Paths
*/
logs_df.groupBy("path").count().sort(desc("count")).show(10)

// ====< Analyzing Web Server Log File >====

/*
Top Ten Error Paths
*/
logs_df.filter($"status" =!= 200).groupBy("path").count().sort(desc("count"))
.show(10)
/*
Number of unique Hosts
*/
val unique_host_count = logs_df.select("host").distinct().count()
println("Unique hosts : %d".format(unique_host_count))
/*
Number of Unique Daily Hosts :
*/
val daily_hosts_df = logs_df.withColumn("day",dayofyear($"time")).withColumn("year",year($"time"))
.select("host","day","year").distinct().groupBy("day","year")
.count().sort("year","day").cache()
daily_hosts_df.show(5)
/*
Average Number of Daily Requests per Host
*/
val total_req_per_day_df = logs_df.withColumn("day", dayofyear($"time"))
.withColumn("year", year($"time"))
.groupBy("day", "year").count()
val avg_daily_request_per_host_df = total_req_per_day_df.join(daily_hosts_df,
total_req_per_day_df("day") === daily_hosts_df("day")
&& total_req_per_day_df("year") === daily_hosts_df("year"))
.select(daily_hosts_df("day"), daily_hosts_df("year"),
(total_req_per_day_df("count") / daily_hosts_df("count")).alias("avg_req_per_host_per_day")).cache()
avg_daily_request_per_host_df.show(5)

// ====< Exploring 404 status codes >====
/*
Let's drill down and explore the error 404 status records, We've all seen those "404 Not Found" web pages.
404 errors are returned when the server cannot find the resource (page or object) the browser client requested.
*/

// Counting 404 Response Codes
val not_found_df = logs_df.where($"status" === 404).cache()
println("found %d 404 Urls".format(not_found_df.count()))

// Listing 404 Status Code Records
not_found_df.select("path").distinct().show(40,false)

// Listing The Top Twenty 404 Response Code Paths :
not_found_df.groupBy("path").count().sort("count").show(20,false)
not_found_df.groupBy("path").agg("host" -> "collect_list","status" -> "count")
.sort("count(status)").show(20)
not_found_df.groupBy("path").agg("host" -> "collect_set","status" -> "count")
.sort("count(status)").show(20)

// Listing the Top Twenty-five 404 Response Code Hosts
not_found_df.groupBy("host").count().sort(desc("count")).show(truncate = false)

// Listing 404 Errors per Day
val errors_by_date_pair_df = not_found_df.withColumn("day", dayofyear($"time"))
.withColumn("year", year($"time")).groupBy("day","year").count()
not_found_df.withColumn("day", dayofyear($"time")).withColumn("year", year($"time"))
.groupBy("day","year").count().sort($"year",$"day").show(10)

}

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

achraf el gdaouni

Written by

Spark and Scala Developer and a Data Science enthusiast

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade