PySpark Technical Interview questions for Data Engineers

Sanjay D
4 min readJul 15, 2023

--

Spark Interview questions that can be expected in many product based companies.Practice these before applying for any Data Engineer or Big Data Developer or Pyspark developer Roles

If you want the solution for the below problems, Link below
https://drive.google.com/drive/folders/1RMjGmbUTC9i8QUNDC9n5buRvY0rAcTWy?usp=sharing

  1. Word count using RDD and DataFrame? (practice both)

Input text file:

hi hello fine god hi hello ji ji ji ji
f fg df sw sw
fine god

Output RDD or DF:

RDD: [(‘fine’,2)(‘fg’,1)(‘sw’,2)(‘hello’,2)(‘f’,1)(‘god’,2)(‘dfn’,1)(‘ji’,4)(‘hio’,2)]

Output DataFrame
DataFrame

2. Get the desired DF after cleaning

Input text file:

Name~|Age
Azarudeen, Shahul~|25
Michel, Clarke~|26
Virat, Kohli~|28
Andrew, Simond~|137
Geogre, Bush~|159
Flintoff, David~|12
Adam, James~|20

Output:

DataFrame

3.Get the desired DF

input text file:

Name|Age|Education
Azar|25|MBA,BE,HSC
Hari|32|
Kumar|35|ME,BE,Diploma

Output DF:

Dataframe

4. Find the total account balance for a Customer_no. It is found by (credit-debit) for each of the Customer_no

Input file:

Customer_No,Card_type,Date,Category,Transaction Type,Amount
1000501,Platinum Card,1/1/2018,Shopping,debit,11.11
1000501,Checking,1/2/2018,Mortgage & Rent,debit,1247.44
1000501,Silver Card,1/2/2018,Restaurants,debit,24.22
1000501,Platinum Card,1/3/2018,Credit Card Payment,credit,2298.09
1000501,Platinum Card,1/4/2018,Movies & DVDs,debit,11.76
1000501,Silver Card,1/5/2018,Restaurants,debit,25.85
1000501,Silver Card,1/6/2018,Home Improvement,debit,18.45
1000501,Checking,1/8/2018,Utilities,debit,45
1000501,Silver Card,1/8/2018,Home Improvement,debit,15.38
1000501,Platinum Card,1/9/2018,Music,debit,10.69
1000501,Checking,1/10/2018,Mobile Phone,debit,89.46
1000501,Platinum Card,1/11/2018,Gas & Fuel,debit,34.87
1000501,Platinum Card,1/11/2018,Groceries,debit,43.54
1000501,Checking,1/12/2018,Paycheck,credit,2000
1000531,Platinum Card,1/13/2018,Fast Food,debit,32.91
1000531,Platinum Card,1/13/2018,Shopping,debit,39.05
1000531,Silver Card,1/15/2018,Groceries,debit,44.19
1000531,Silver Card,1/15/2018,Restaurants,debit,64.11
1000531,Checking,1/16/2018,Utilities,debit,35
1000531,Checking,1/16/2018,Utilities,debit,60
1000531,Checking,1/19/2018,Paycheck,credit,2000
1000531,Platinum Card,1/20/2018,Shopping,debit,50.21
1000531,Platinum Card,1/22/2018,Credit Card Payment,credit,554.99
1000531,Silver Card,1/22/2018,Credit Card Payment,credit,309.81
1000531,Checking,1/22/2018,Credit Card Payment,debit,554.99
1000531,Silver Card,1/22/2018,Home Improvement,debit,17.38
1000531,Checking,1/23/2018,Credit Card Payment,debit,309.81
1000654,Platinum Card,1/24/2018,Coffee Shops,debit,3
1000654,Checking,1/25/2018,Internet,debit,69.99
1000654,Silver Card,1/29/2018,Gas & Fuel,debit,30.42
1000654,Silver Card,1/29/2018,Restaurants,debit,25
1000654,Platinum Card,1/29/2018,Restaurants,debit,17.62
1000654,Platinum Card,2/1/2018,Groceries,debit,27.79
1000654,Platinum Card,2/1/2018,Shopping,debit,11.11
1000654,Checking,2/2/2018,Mortgage & Rent,debit,1247.44
1000654,Checking,2/2/2018,Paycheck,credit,2000
1000654,Platinum Card,2/3/2018,Restaurants,debit,57.02
1000654,Platinum Card,2/4/2018,Movies & DVDs,debit,11.76
1000654,Platinum Card,2/5/2018,Credit Card Payment,credit,145.14
1000654,Silver Card,2/6/2018,Credit Card Payment,credit,154.13
1001863,Checking,2/7/2018,Credit Card Payment,debit,154.13
1001863,Checking,2/7/2018,Utilities,debit,65
1001863,Platinum Card,2/9/2018,Haircut,debit,30
1001863,Platinum Card,2/9/2018,Music,debit,10.69
1001863,Platinum Card,2/10/2018,Fast Food,debit,10.66
1001863,Platinum Card,2/11/2018,Restaurants,debit,106.8
1001863,Silver Card,2/12/2018,Gas & Fuel,debit,36.47
1001863,Checking,2/12/2018,Mobile Phone,debit,89.52
1001863,Silver Card,2/14/2018,Alcohol & Bars,debit,14
1001863,Platinum Card,2/15/2018,Restaurants,debit,10
1001863,Checking,2/15/2018,Utilities,debit,60
1001863,Checking,2/16/2018,Paycheck,credit,2000
1001863,Silver Card,2/16/2018,Restaurants,debit,8
1001863,Checking,2/16/2018,Utilities,debit,35
1001863,Silver Card,2/20/2018,Groceries,debit,35.95
1001863,Silver Card,2/20/2018,Restaurants,debit,23.51
1001863,Platinum Card,2/21/2018,Coffee Shops,debit,2
1001863,Silver Card,2/22/2018,Coffee Shops,debit,4
1001863,Platinum Card,2/26/2018,Credit Card Payment,credit,765.37
1001368,Silver Card,2/26/2018,Credit Card Payment,credit,156.11
1001368,Checking,2/26/2018,Credit Card Payment,debit,765.37
1001368,Checking,2/26/2018,Internet,debit,74.99
1001368,Silver Card,2/26/2018,Restaurants,debit,85.52
1001368,Silver Card,2/26/2018,Gas & Fuel,debit,32.21
1001368,Checking,2/27/2018,Credit Card Payment,debit,156.11
1001368,Silver Card,3/1/2018,Groceries,debit,32.07
1001368,Platinum Card,3/1/2018,Shopping,debit,13.13
1001368,Checking,3/2/2018,Paycheck,credit,1247.44
1001368,Checking,3/2/2018,Paycheck,credit,2000
1001368,Silver Card,3/3/2018,Groceries,debit,23.74
1001368,Platinum Card,3/4/2018,Groceries,debit,10.69
1001368,Platinum Card,3/4/2018,Movies & DVDs,debit,11.76
1001368,Platinum Card,3/4/2018,Restaurants,debit,42.24
1002324,Platinum Card,3/5/2018,Coffee Shops,debit,3
1002324,Silver Card,3/5/2018,Credit Card Payment,credit,761.59
1002324,Checking,3/5/2018,Credit Card Payment,debit,761.59
1002324,Platinum Card,3/7/2018,Coffee Shops,debit,3.5
1002324,Platinum Card,3/8/2018,Gas & Fuel,debit,34.9
1002324,Checking,3/8/2018,Utilities,debit,52
1002324,Platinum Card,3/9/2018,Groceries,debit,20.72
1002324,Platinum Card,3/9/2018,Groceries,debit,5.09
1002324,Platinum Card,3/9/2018,Music,debit,10.69
1002324,Platinum Card,3/12/2018,Groceries,debit,19.35
1002324,Checking,3/12/2018,Mobile Phone,debit,89.52
1002324,Platinum Card,3/13/2018,Shopping,debit,45.75
1002324,Platinum Card,3/14/2018,Groceries,debit,22.5
1002324,Platinum Card,3/14/2018,Restaurants,debit,8.49
1002324,Platinum Card,3/15/2018,Coffee Shops,debit,3.5
1002324,Checking,3/15/2018,Utilities,debit,60
1002324,Checking,3/16/2018,Paycheck,credit,2000
1002324,Silver Card,3/17/2018,Alcohol & Bars,debit,19.5
1000210,Platinum Card,3/17/2018,Fast Food,debit,23.34
1000210,Silver Card,3/19/2018,Restaurants,debit,36.48
1000210,Checking,3/19/2018,Utilities,debit,35
1000210,Platinum Card,3/20/2018,Shopping,debit,14.97
1000210,Silver Card,3/22/2018,Gas & Fuel,debit,30.55
1000210,Platinum Card,3/23/2018,Credit Card Payment,credit,559.91
1000210,Checking,3/23/2018,Credit Card Payment,debit,559.91
1000210,Silver Card,3/23/2018,Groceries,debit,11.76
1000210,Checking,3/26/2018,Internet,debit,74.99
1000210,Silver Card,3/28/2018,Groceries,debit,16.06
1000210,Silver Card,3/28/2018,Restaurants,debit,24.98
1000210,Silver Card,3/29/2018,Restaurants,debit,17.64
1000210,Silver Card,3/30/2018,Groceries,debit,9.09
1000210,Checking,3/30/2018,Paycheck,credit,2000

Output DF:

5. For each location, if the name property is ‘state’ then return its value property.
E.g. : {“name”: “state”, “value”: “IL”} here ‘IL ’is returned because the value of name is ‘state’

Input data dictionary:

dataDictionary = [
('12345',{"addressAttributes": [{"name": "houseNumber", "value": "718"}, {"name": "streetName", "value": "VIENNA"}, {"name": "streetSuffix", "value": "ST"}, {"name": "city", "value": "METROPOLIS"}, {"name": "state", "value": "IL"}, {"name": "zip5", "value": "62960"}, {"name": "zip4", "value": "1642"}, {"name": "country", "value": "USA"}]}),
('678910',{"addressAttributes": [{"name": "houseNumber", "value": "245"}, {"name": "streetName", "value": "LONGVIEW"}, {"name": "streetSuffix", "value": "DR"}, {"name": "city", "value": "PADUCAH"}, {"name": "state", "value": "KY"}, {"name": "zip5", "value": "42001"}, {"name": "zip4", "value": "5968"}, {"name": "country", "value": "USA"}]})
]

Output DF:

DF

6. Find the students who got the highest mark for each subject

Input Df:

Sub|Name|Marks
Eng|John|85
Math|John|76
Science|John|89
Eng|Maria|91
Math|Maria|74
Science|Maria|82
Eng|Karthik|91
Math|Karthik|100
Science|Karthik|76

Output Df:

7. Find whether the current day temperature is greater than the previous day temperature

Input Data:

# Create a DataFrame with the provided data
data = [
(1, '2023-06-01', 10),
(2, '2023-06-02', 25),
(3, '2023-06-03', 20),
(4, '2023-06-04', 30)
]

Output DF:

DF

8.UnNest the JSON and create dataframe

Read the below JSON file.

{
"sensorName": "snx001",
"sensorDate": "2020-01-01",
"sensorReadings": [
{
"sensorChannel": 1,
"sensorReading": 3.7465084060850105,
"datetime": "2020-01-01 00:00:00"
},
{
"sensorChannel": 2,
"sensorReading": 3.8465084060850105,
"datetime": "2021-01-01 00:00:00"
}]
}

output df:

DF

--

--