PySpark Interview Question 03:

Monty Poddar
2 min readMar 2, 2024

Lets talk about the problem statement:

Given the input file where columns are stationId, timeOfTheReading, readingType, temperatureRecorded, and few other columns. We need to find the minimum temperature for each station id.

Below solution is using RDD-

We can think of using filter transformation in below two ways, so when we follow

temperature_lines = lines.filter(lambda line: “TMIN” in line)

This line filters the RDD lines to keep only the lines where the string "TMIN" is present in the line. It checks if the substring "TMIN" exists anywhere in each line of the RDD.

temperature_lines2 = lines.filter(lambda line: “TMIN” )

This line, however, has an issue. The lambda function "TMIN" is always evaluated to True because in Python, non-empty strings are considered as True in a boolean context. Thus, this lambda function always returns True for every line in the RDD lines. Consequently, it doesn't perform any filtering based on the presence of "TMIN" in the lines.

In summary, the first line correctly filters the RDD to keep lines containing the substring “TMIN”, while the second line doesn’t perform the intended filtering due to the incorrect usage of the lambda function. The correct usage should be to use the in operator to check for the presence of "TMIN" in each line, as shown in the first line of code.

The steps we followed here are:

  1. We read the input data from the file using sc.textFile().
  2. We filter out the records containing temperature readings (TMAX and TMIN) using the filter() transformation.
  3. We split each line by comma and extract the station ID and temperature.
  4. We use reduceByKey() transformation to group by station ID and calculate the minimum temperature.
  5. We collect and display the results.

--

--