Thinking in Big Data Part 2 — Transpose Matrix for large datasets

Chandu Kavar
2 min readFeb 12, 2018

--

I would suggest you read Part 1 first. In Part 1, I have explained how to find top N most-viewed products from a large dataset in E-commerce.

The matrix transpose problem is very simple. Let’s say if you have matrix M with below values:

R/C  0   1   2 0   1   2   3
1 4 5 6
2 7 8 9

You can find the transpose matrix by turning all the rows of given matrix into columns. Here is the transpose matrix of M:

R/C  0   1   2 0   1   4   7
1 2 5 8
2 3 6 9

The format of the data is —

  • row number: integer
  • list of value: integer separated by space

Based on the data size, you would take a different approach to find the transpose matrix.

For smaller datasets, we can easily implement this using any programming language. Here is the solution written in Scala:

// Transpose of matrixval input = List(
(0, "1 2 3"),
(1, "4 5 6"),
(2, "7 8 9")
)
def formatRow(row: String) = row.split(" ").map(_.toInt)def printRow(values: List[Int]) = {
values.map(print(_))
println()
}
input
.map(row => formatRow(row._2))
.transpose
.foreach(x => printRow(x))

Let’s try to understand how can we solve this problem using MapReduce. ( Please try it yourself, before moving on to the solution. )

Input:

0, 1 2 3
1, 4 5 6
2, 7 8 9

Assume that data size is in TBs and data is partitioned into multiple splits.

Input/Output of Mapper and Reducer

Mapper Input:
key = row, value = list of value
Mapper Output:
key = row, value = (row, (column, value))
Reducer Input:
key = row, value = (row, (column, value))
Reducer Output:
key = row, value = (row, [( col1, value1),
( col2, value2),
...
...
( colN, valueN)] )

Here is the data flow diagram:

DFD for matrix transpose solution in MapReduce

Nowadays, developers prefer to write in Spark instead of MapReduce due to latter’s higher overhead in container launch and higher File I/O. But, I think an understanding of MapReduce paradigm is still needed to understand how BigData processing frameworks process the data underneath.

Here is the solution written in Spark:

val inputRDD = sc.parallelize(List(
(0, "1 2 3"),
(1, "4 5 6"),
(2, "7 8 9")
))
def formatRow(row: String) = row.split(" ").map(_.toInt)inputRDD
.map{ case(row, values) => (row, formatRow(values)) }
.map{ case(row, formattedValues) =>
values.map( value => (row, value))
}
.flatMap { rowValues => rowValues.indices zip rowValues }
.groupByKey

Stay tuned for Part 3 wherein I will explain how to calculate the cumulative sum of one billion numbers.

Thanks for reading. If you found this blog helpful, please recommend it and share it. Follow me for more articles on Big Data.

Thanks to Shakti Garg and Noah Pereira who reviewed this blog. I truly appreciate it.

--

--