PySpark Interview Question 03:
Lets talk about the problem statement:
Given the input file where columns are stationId, timeOfTheReading, readingType, temperatureRecorded, and few other columns. We need to find the minimum temperature for each station id.
Below solution is using RDD-
We can think of using filter transformation in below two ways, so when we follow
temperature_lines = lines.filter(lambda line: “TMIN” in line)
This line filters the RDD lines
to keep only the lines where the string "TMIN" is present in the line. It checks if the substring "TMIN" exists anywhere in each line of the RDD.
temperature_lines2 = lines.filter(lambda line: “TMIN” )
This line, however, has an issue. The lambda function "TMIN"
is always evaluated to True
because in Python, non-empty strings are considered as True
in a boolean context. Thus, this lambda function always returns True
for every line in the RDD lines
. Consequently, it doesn't perform any filtering based on the presence of "TMIN" in the lines.
In summary, the first line correctly filters the RDD to keep lines containing the substring “TMIN”, while the second line doesn’t perform the intended filtering due to the incorrect usage of the lambda function. The correct usage should be to use the in
operator to check for the presence of "TMIN" in each line, as shown in the first line of code.
The steps we followed here are:
- We read the input data from the file using
sc.textFile()
. - We filter out the records containing temperature readings (TMAX and TMIN) using the
filter()
transformation. - We split each line by comma and extract the station ID and temperature.
- We use
reduceByKey()
transformation to group by station ID and calculate the minimum temperature. - We collect and display the results.