Flipkart Data Engineer Interview

Gunjan Sahu
Nerd For Tech
Published in
6 min readJul 12, 2023

Organisation — Flipkart (https://www.flipkart.com/)
Role — Data Engineer
Experience — 2+ years

Q1.Find Year On Year growth percentage.

CREATE TABLE growth (
customer VARCHAR(50),
year INT,
sales INT
);
INSERT INTO growth VALUES ('cust1', 2014, 2000);
INSERT INTO growth VALUES ('cust1', 2015, 5000);
INSERT INTO growth VALUES ('cust2', 2013, 1000);
INSERT INTO growth VALUES ('cust2', 2014, 1500);
INSERT INTO growth VALUES ('cust2', 2015, 800);
select * from growth;
+----------+------+-------+
| customer | year | sales |
+----------+------+-------+
| cust1 | 2014 | 2000 |
| cust1 | 2015 | 5000 |
| cust2 | 2013 | 1000 |
| cust2 | 2014 | 1500 |
| cust2 | 2015 | 800 |
+----------+------+-------+

solution:

SELECT 
g1.customer,
g1.year,
ROUND((100.0 * (g1.sales - g2.sales) / g2.sales), 2) AS YoY
FROM
growth g1
JOIN
growth g2 ON g1.customer = g2.customer AND g1.year = g2.year + 1;
+----------+------+--------+
| customer | year | YoY |
+----------+------+--------+
| cust1 | 2015 | 150.00 |
| cust2 | 2014 | 50.00 |
| cust2 | 2015 | -46.67 |
+----------+------+--------+

OR to be more precise try this

WITH SalesData AS (
SELECT
customer,
year,
sales,
LAG(sales) OVER (PARTITION BY customer ORDER BY year) AS previous_year_sales
FROM
growth
)
SELECT
customer,
year AS current_year,
sales AS current_year_sales,
previous_year_sales AS previous_year_sales,
((sales - previous_year_sales) / previous_year_sales) * 100 AS yoy_growth_rate
FROM
SalesData
ORDER BY
customer,
year;
+----------+--------------+--------------------+---------------------+-----------------+
| customer | current_year | current_year_sales | previous_year_sales | yoy_growth_rate |
+----------+--------------+--------------------+---------------------+-----------------+
| cust1 | 2014 | 2000 | NULL | NULL |
| cust1 | 2015 | 5000 | 2000 | 150.0000 |
| cust2 | 2013 | 1000 | NULL | NULL |
| cust2 | 2014 | 1500 | 1000 | 50.0000 |
| cust2 | 2015 | 800 | 1500 | -46.6667 |
+----------+--------------+--------------------+---------------------+-----------------+

Q2.Questions based on the below employee details tables.

CREATE TABLE employee (
eid VARCHAR(10),
ename VARCHAR(50),
deptid VARCHAR(10),
salary INT
);
INSERT INTO employee VALUES ('E1', 'Rahul', 'D1', 20000);
INSERT INTO employee VALUES ('E2', 'Anjali', 'D2', 15000);
INSERT INTO employee VALUES ('E3', 'Rohan', 'D3', 25000);
INSERT INTO employee VALUES ('E4', 'Ramesh', 'D4', 5000);
INSERT INTO employee VALUES ('E5', 'Pooja', 'D1', 10000);
INSERT INTO employee VALUES ('E6', 'Divya', 'D2', 15000);
INSERT INTO employee VALUES ('E7', 'Amir', 'D3', 10000);
select * from employee;CREATE TABLE department (
deptid VARCHAR(10),
deptname VARCHAR(50)
);
INSERT INTO department VALUES ('D1', 'Engineering');
INSERT INTO department VALUES ('D2', 'IT and Networking');
INSERT INTO department VALUES ('D3', 'Customer Support');
INSERT INTO department VALUES ('D4', 'Front Desk');
INSERT INTO department VALUES ('D5', 'Sales');
select * from department;Table employee:
+------+--------+--------+--------+
| eid | ename | deptid | salary |
+------+--------+--------+--------+
| E1 | Rahul | D1 | 20000 |
| E2 | Anjali | D2 | 15000 |
| E3 | Rohan | D3 | 25000 |
| E4 | Ramesh | D4 | 5000 |
| E5 | Pooja | D1 | 10000 |
| E6 | Divya | D2 | 15000 |
| E7 | Amir | D3 | 10000 |
+------+--------+--------+--------+
Table department:+--------+-------------------+
| deptid | deptname |
+--------+-------------------+
| D1 | Engineering |
| D2 | IT and Networking |
| D3 | Customer Support |
| D4 | Front Desk |
| D5 | Sales |
+--------+-------------------+

Questions :

  1. Department with Highest Salary Expenditure:
SELECT
d.deptname AS Department,
SUM(e.salary) AS TotalSalary
FROM
department d
JOIN
employee e ON d.deptid = e.deptid
GROUP BY
d.deptid
ORDER BY
TotalSalary DESC
LIMIT 1;
+--------+------------------+-------------+
| deptid | Department | TotalSalary |
+--------+------------------+-------------+
| D3 | Customer Support | 35000 |
+--------+------------------+-------------+

2. Employees with the Same Salary:

WITH cte1 AS (
SELECT ename, salary, ROW_NUMBER() OVER (ORDER BY eid) AS rank1
FROM employee
),
cte2 AS (
SELECT ename, salary, ROW_NUMBER() OVER (ORDER BY eid) AS rank2
FROM employee
)
SELECT cte1.ename, cte2.ename, cte2.salary
FROM cte1
JOIN cte2 ON cte1.rank1 < cte2.rank2 AND cte1.salary = cte2.salary;
+--------+-------+--------+
| ename | ename | salary |
+--------+-------+--------+
| Anjali | Divya | 15000 |
| Pooja | Amir | 10000 |
+--------+-------+--------+

3. Employee with the 4th Highest Salary:

WITH cte AS (
SELECT ename, DENSE_RANK() OVER (ORDER BY salary DESC) AS emp_rank
FROM employee
)
SELECT ename
FROM cte
WHERE emp_rank = 4;
+-------+
| ename |
+-------+
| Pooja |
| Amir |
+-------+

4. Departments with More Than One Employee:

SELECT deptid AS Departments
FROM employee
GROUP BY deptid
HAVING COUNT(*) > 1;
+-------------+
| Departments |
+-------------+
| D1 |
| D2 |
| D3 |
+-------------+

5. Departments with No Employees:

SELECT deptname
FROM department
WHERE deptid NOT IN (SELECT DISTINCT deptid FROM employee);
+----------+
| deptname |
+----------+
| Sales |
+----------+

6. Departments and Their Employees:

SELECT deptid, GROUP_CONCAT(ename) AS dept_employees
FROM employee
GROUP BY deptid;
+--------+----------------+
| deptid | dept_employees |
+--------+----------------+
| D1 | Rahul,Pooja |
| D2 | Anjali,Divya |
| D3 | Rohan,Amir |
| D4 | Ramesh |
+--------+----------------+

Q.) CSV vs ORC vs Parquet vs AVRO (Explain each file format in detail and which is suitable for what type of situation)
Q.) Hive Partitioning & Bucketing (Why/How/When)
Q.) Repartition & coalesce (Why/How/When)
Q.)Data Modeling of a Library Database (State dimensions and facts)

Q1. CSV vs ORC vs Parquet vs AVRO:

CSV (Comma-Separated Values):

  • CSV is a plain text format where each value is separated by a comma.
  • It is human-readable and widely supported, making it simple for data exchange.
  • However, CSV doesn’t provide built-in compression or type enforcement, which can lead to larger file sizes and potential data type issues.

ORC (Optimized Row Columnar):

  • ORC is a columnar storage format optimized for Hive.
  • It compresses data efficiently and supports predicate pushdown, column pruning, and lightweight indexing, improving query performance.
  • ORC is suitable for analytics workloads where fast query performance is crucial, especially for large-scale data.

Parquet:

  • Parquet is another columnar storage format designed for big data processing frameworks like Apache Hadoop and Apache Spark.
  • It supports nested data structures, predicate pushdown, and compression algorithms.
  • Parquet is great for analytics scenarios and complex data types.

Avro:

  • Avro is a row-based, schema-aware binary format.
  • It stores the schema with the data, making it self-descriptive and allowing for schema evolution.
  • Avro is suitable for use cases where data schema changes frequently and compatibility between data versions is essential.

Q2. Hive Partitioning & Bucketing:

Partitioning:

  • Hive partitioning involves dividing data into multiple partitions based on one or more columns.
  • It helps improve query performance by allowing the query engine to skip unnecessary partitions during data scanning.
  • Partitioning is useful when queries often filter on the partitioned columns.
  • It’s typically applied on high-cardinality columns, like dates, and creates a directory structure that reflects the partitioning scheme.

Bucketing:

  • Bucketing (also known as clustering) involves dividing data into smaller, equally sized buckets.
  • It helps optimize join operations by ensuring that rows with the same bucketing key values are co-located, reducing data shuffling.
  • Bucketing is suitable when optimizing join-heavy workloads, but it may not be as effective for other types of queries.

Q3. Repartition & Coalesce:

Repartition:

  • Repartitioning involves reshuffling data to create a new partitioning scheme.
  • It’s used to change the number of partitions or to redistribute data evenly across partitions.
  • Repartitioning is often used to improve data distribution for better query performance.

Coalesce:

  • Coalescing reduces the number of partitions without shuffling data.
  • It’s used to merge small partitions into larger ones to improve storage efficiency.
  • Coalesce is appropriate when you want to minimize the number of partitions without impacting data distribution significantly.

Q4. Data Modeling of a Library Database:

In a library database, you could have the following dimensions and facts:

Dimensions:

Books Dimension:

  • BookID (Primary Key)
  • Title
  • Author
  • Genre
  • Publication Year
  • ISBN

Customers Dimension:

  • CustomerID (Primary Key)
  • Name
  • Address
  • Phone
  • Email

Library Branch Dimension:

  • BranchID (Primary Key)
  • Name
  • Location
  • Manager

Facts:

Book Loans Fact:

  • LoanID (Primary Key)
  • BookID (Foreign Key)
  • CustomerID (Foreign Key)
  • BranchID (Foreign Key)
  • Loan Date
  • Due Date
  • Return Date

Book Reservations Fact:

  • ReservationID (Primary Key)
  • BookID (Foreign Key)
  • CustomerID (Foreign Key)
  • BranchID (Foreign Key)
  • Reservation Date

This data model allows you to track book loans, reservations, customers, and library branches. It’s suitable for scenarios where you want to analyze book borrowing patterns, customer preferences, and branch activities.

Keep in mind that the appropriate modeling approach can vary based on specific requirements, querying patterns, and performance considerations.

Hope you find it helpful! Please like and share.

Happy Learning! ❤️

Hope you find it helpful! Please like and share.

Happy Learning! ❤️

--

--