Member-only story
Why We Use Sparse Matrices for Recommender Systems
Introduction to SciPy’s Sparse Module
In recommender systems, we typically work with very sparse matrices as the item universe is very large while a single user typically interacts with a very small subset of the item universe. Take YouTube for example — a user typically watches hundreds if not thousands of videos, compared to the millions of videos YouTube has in its corpus, resulting in sparsity of >99%.
This means that when we represent the users (as rows) and items (as columns) in a matrix, the result is an extremely sparse matrix consisting of many zero values (see below).
In a real-life scenario, how do we best represent such a sparse user-item interaction matrix?
Why can’t we just use Numpy Arrays or Pandas DataFrames?
To understand this, we have to understand the two major constraints on computing — time and memory. The former is simply what we know as “how much time a program takes to run” whereas the latter is “how much RAM is being used by the program”. The former is quite straightforward but as for the latter, making sure our program doesn’t consume all our memory is important especially when we deal with large datasets, otherwise we would encounter the famous “out-of-memory” error.
Yes, every program and application on our PC uses some memory (see below image). When we run matrix computations and we want to store those sparse matrices as a Numpy array or Pandas DataFrame, they consume memory as well.
To formalize these two constraints, they are known as time and space complexity (memory).
Space Complexity
When dealing with sparse matrices, storing them as a full matrix (from this point…