Encoding methods

Introducing Numerical Matrix Encoding: problems and solutions

Introducing several new algorithms for the encoding of a Numerical Matrix

Michelangiolo Mazzeschi
Plain Simple Software

--

***Through the following article I am trying to introduce several new algorithms. Because I have not been able to find any similar solutions to the problem I am facing on the web, I tried solving it myself. I am open to criticism and happy to remove this article if you can prove the solution already exists.
Source code available on my github repo.

Encoding methods are algorithms typically used to convert categorical data into a numerical (vector-based) format.
The two most common applications for encoded data are:

  • Training of neural networks
  • Search applications powered by the knn algorithm

The definition of encoders, however, does not strictly apply to categorical data. Numerical data can be encoded as well when the problem requires it.

Example of Numerical Matrix Encoding

In this article, I will focus specifically on how to encode matrices only composed of numerical data: hence the name numerical matrix encoding. In addition, I will introduce specifically crafted algorithms for numerical matrix encoding:

  • M-sorted Encoding (Merged Sorted encoding)
  • T-sorted Encoding (Transposed Sorted encoding)
  • pre-count encoding
  • Spre-count encoding (summative pre-count encoding)

Vector Encoding vs. Matrix Encoding

Most known encoding techniques focus on encoding a vector in a n-dimensional space. Below, you can understand the difference:

Vector encoding

A vector is a single array ( a list of elements). For example, we can encode the following vector using one-hot encoding, resulting in the following output:

[ 'apple', 'banana', 'apple', 'apple', 'cake' ] -> [ 0, 1, 0, 0, 2 ]

If the original vector was made of numbers, rather than categorical labels (apple, banana, cake), it would be easy to encode: we could either leave it that way or sort the vector to make the order of the numbers irrelevant. This is the main reason why there is no need to expand the theory on numerical vector encoding (it is fairly too simple). Most problems faced by data scientists have to do with categorical vector encoding (ex. NLP applications).

Matrix encoding

A matrix is a collection of vectors. When we have to deal with the encoding of a matrix, the problem becomes much more complex:

[ [ 5, 7 ], [ 2, 8 ], [ 3, 5 ] ]

When dealing with matrices, there are more constraints we need to take into account, such as:

  • Is the number of vectors fixed or dynamic?
  • Is the vector order relevant or irrelevant?
  • Is the order of the elements of each vector relevant or irrelevant?

***I have purposely omitted some unusual, yet possible, constraints, to make the article digestible. Other possible constraints we can think of are:

  • Is the length of each vector fixed or dynamic?
  • Do our elements have a fixed maximum?

We could find a real-world problem for each permutation of the forelisted constraints. Our objective is to convert our matrix into a vector so that it either can be processed by a neural network, or ingested in a vector database.

Practical example

If you are wondering about the purpose of this entire research, let me give you a practical example of where the following algorithm is needed:

The login data of every customer of a shopping website is recorded.
The following values are stored for every user daily.

  • Money spent on the day
  • Number of total clicks
  • Number of clicked products
# example of 2 customers
c1 = [ [ 25, 7, 8 ], [ 31, 8, 3 ], [ 84, 5, 12 ] ]
c2 = [ [ 0, 5, 14 ], [ 25, 6, 12 ], [ 36, 2, 6 ], [ 15, 8, 12 ] ]

We want to group users with the most similar track record. For the way we have articulated the problem, these are the constraints we need to follow:

  • the number of vectors is dynamic
  • the vector order is irrelevant
  • the element order is relevant

Once we have understood how to convert the problem into a mathematical form, we can use one of the proposed encoding methods (next section) to solve it.

A study on Numerical Matrix Encoding

In this section, I am going to list various encoding solutions dependent on the constraints of our problem:

Matrix Structure

Before listing the most common problem configurations of numerical matrix encoding, we have to define some basic terminology. In every problem, I will have a set containing n elements, each one following this format:

components of the Numeric Matrix

Fixed number of vectors

If the number of vectors is fixed, each of our samples has an equal number of vectors. That makes the encoding much easier because we can know from the beginning the number of dimensions occupied by our samples.

1. vector order is relevant, element order is relevant
This is the easiest possible configuration of our problem. We do not need to apply any transformation but leave the order of elements untouched, merging all values in a single vector. Because the order matters for both vectors and elements, we have to make sure to maintain the original position of all values for each corresponding feature (dimension) in our final vector.

vector order is relevant, element order is relevant

2. vector order is relevant, element order is irrelevant
We are interested in maintaining the vector order, which means we cannot alter the number of original dimensions. However, because we want to ignore the order of the elements, we can sort them: by doing this, we can compare different samples using Euclidean distance without fixing a unique value to the same dimension.

vector order is relevant, element order is irrelevant

3. vector order is irrelevant, element order is relevant
I find this to be the most complex configuration of our problem.
I propose T-sorted Encoding (Transposed Sorted encoding), as a solution: by performing a transposition, we maintain the order of the elements by assigning each element to its corresponding dimension (ex. all the elements in the first position 8, 5, 2 will be grouped in the first dimension, the same for the elements in the second position 3, 2, 7).
However, we need to ignore the vector order, which means we are free to sort the elements in each dimension.

vector order is irrelevant, element order is relevant

4. vector order is irrelevant, element order is irrelevant
I propose M-sorted Encoding (Merged Sorted encoding), as a solution:
Because we do not care at all about maintaining the positions of our raw data, we can first merge all the values, and then sort them entirely.

vector order is irrelevant, element order is irrelevant

Dynamic number of vectors

Another constraint that I want to take into account is the possibility of having a non-fixed number of vectors to compare. This changes drastically the way we need to encode values. For example, if we had two samples with a different number of vectors:

s1 = [ [ 5, 7 ], [ 2, 8 ], [ 3, 5 ] ]
s2 = [ [ 3, 5 ], [ 6, 6 ], [ 5, 2 ], [ 9, 8 ] ]

Any of the previous encoding methods would fail, because we would end up with two vectors of different lengths (and this is not acceptable for either neural networks or search engines).

To tackle this specific issue we need to perform a similar encoding to one-hot encoding, but that works with numerical values. One solution I am introducing is called pre-count encoding. For each number to be comparable with one another, I am activating (1) not only the index of the number itself but also their previous positions (we assume the maximum number to be 10, so we can standardize vector length).

# vector obtained after performing one of the aforelisted transformations
v1 = [ 5, 7, 2, 8, 3, 5 ]

5 = [ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 ]
7 = [ 1, 1, 1, 1, 1, 1, 1, 0, 0, 0 ]
2 = [ 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ]
8 = [ 1, 1, 1, 1, 1, 1, 1, 1, 0, 0 ]
3 = [ 1, 1, 1, 0, 0, 0, 0, 0, 0, 0 ]
5 = [ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 ]

We have to think in separate dimensions. If we only had activated the indexes (index-based encoding), there would be no difference in distance between the combinations of (a, b, c). For example, the distance between (5, 7) is the same as the distance between (2, 7), because we are treating each number as a separate class (in its own dimension).

# index-based encoding
a = 5 = [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 ]
b = 7 = [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 ]
c = 2 = [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 ]

euclidean_distance(a, b) = 1.41
euclidean_distance(a, c) = 1.41
euclidean_distance(b, c) = 1.41

However, by using pre-count encoding, the distance between (5, 7) is closer than the distance between (2, 7), as it should be.

# pre-count encoding
a = 5 = [ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 ]
b = 7 = [ 1, 1, 1, 1, 1, 1, 1, 0, 0, 0 ]
c = 2 = [ 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ]

euclidean_distance(a, b) = 1.41
euclidean_distance(a, c) = 1.73
euclidean_distance(b, c) = 2.23

Consider that our final vector has multiple values, so we now need to sum them. Hence the name of the final variation of the encoding algorithm: Spre-count encoding. To finalize the algorithm, because we want to make an equal comparison of the elements, we need to normalize the output:

v1 = [ 5, 7, 2, 8, 3, 5 ]

# pre-count encoding
5 = [ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 ]
7 = [ 1, 1, 1, 1, 1, 1, 1, 0, 0, 0 ]
2 = [ 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 ]
8 = [ 1, 1, 1, 1, 1, 1, 1, 1, 0, 0 ]
3 = [ 1, 1, 1, 0, 0, 0, 0, 0, 0, 0 ]
5 = [ 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 ]

#Spre-count encoding
-> [ 6, 6, 5, 4, 4, 2, 2, 1, 0, 0 ]
-> [ 1.0, 1.0, .83, .83, .33, .33, .17 ]

By applying Spre-count encoding to n numerical matrices we can make a fair comparison using the nearest neighbor algorithm (knn) on a Euclidean distance.

--

--