Tile-Coding: An Efficient Sparse-Coding Method for Real-Valued Data

Published in

Criteo Tech Blog

7 min readJul 22, 2019

Introduction:

Arguably the most important problem in machine learning (ML) is representing the raw input-data that can be used as a good predictor for the ML model. Such data representation often is presented as a feature vector, that can be used for training the model. With the advent of Deep Neural Networks (DNNs) such as Convolutional Neural Networks (CNNs), generating such features is becoming more automated in the hidden layers but still the input-data should be represented in the first layer, called visible layer. The same story applies to Natural Language Processing (NLP). Unlike data used in NLP and Computer Vision, in many cases the data is not structural and homogeneous — it is multi-modal. For example, in Ad-Tech data logs store a combination of categorical and real-valued features for predictions — a feature might denote click-through-rate (CTR) since-last-week, Age, Geo-Location of the user, etc. The same story is true in a wide range of problems with multi-modal variables from robotic tasks to weather forecasting, etc.

The question which we would like to address here is how such diverse form of real-valued data can be represented as input to parametrized ML models such as Logistic Regression (LR)or DNNs?

An immediate data normalization approach, which often has been used due to its simplicity, would be discretizing the real-valued features, also called Bucketization: here, a real-valued input is transformed into a binary feature vector whose elements are zero except the one corresponding with the bucket where the value falls into. This sparse-coding method, not only normalizes the input data but also projects the data into a large dimensional space (the total size of buckets) and as such the generalized linear models, such as LR, will become effective for predicting highly non-linear functions (please see the non-linear function prediction example, bellow). However, bucketizing the input-data has several drawbacks as follows:

It can have a large prediction-error due to the resolution of bucket-size.
The dimensionality of feature-vector, and thus model-size, increases if a much lower resolution has been selected for the bucket-size.
It lacks generalization as we explain in Figure 1.

An alternative idea to address the above issues is what is called as Tile-Coding–a sparse-coding method that generalizes prediction to neighboring points.

Figure 1. A schematic picture to demonstrate tile-coding idea. The variable x, in the interval [a,b], is discretized into 4 buckets in Tiling 1 with indices ranging from 1 to 4. The Tiling 2, and similarly Tiling 3, represents shifting the Tiling 1 by a small value, which is typically the width of bucket divided by the number of tilings (3). Here, we observe that the two points p and q do not share any common feature through the naive bucketization approach, but after adding Tiling 3 they share a common feature through the bucket index 9.

Figure 1., shows how tile-coding works in 1-dimension as follows: Consider a feature variable x that can take values in the interval [a,b]. The discretization step (called Tiling 1 in Figure 1. ) divides the interval into 4 buckets and as such the data point p would be represented by the binary vector [1,0,0,0] and similarly q would be represented by [0, 1,0,0]. For clarity, let us represent the two vectors by their sparse-representation in the form of p=[1] and q=[2], where each element of the array indicates the index number of the bucket where the value falls into. Using tile-coding idea, now we shift all the buckets of Tiling 1 by a small amount (typically, bucket-size divided by the number-of-total Tilings; that is 3 in the Figure.1), and generate Tiling 2 and stack it on the top of Tiling 1 as is depicted in Figure 1. Now we get p=[1,5], and q=[2,6]. So far, p and q have no relationship through their representations. Then, we do one more shift on the Tiling 2 and generate the Tiling 3 and stack it on the Tiling 2. Now their representations become p=[1,5,9] and q=[2,6,9]. Immediately, with the Tiling 3, p and q become related through the tile index 9. This is the key to the generalization of tile-coding.

As we can see the size of the resulting feature vector now is 3 times than the original size through bucketization, but as we will see later, it still can provide better performance even if we were to reduce the bucket resolution by the factor of 3.

While we only used 1-dimensional input data, the tile-coding idea can be extended to data with multiple feature variables. For example, Figure 2., shows how to do tile-coding for the 2-dimensional case.

Figure 2. Tile-Coding in 2-dimension (Courtesy of Sutton & Barto, 2018)

The idea of tile-coding, historically, goes back to Albus’s Cerebellar Model Arithmetic Computer (CMAC) [1975] , as a model designed for learning in neural networks (NNs). Figure 3., shows how similar patters (S1 and S2 ) share similar representations.

Figure 3. Left panel show non-related inputs (S1 and S2) while the right panel shows the related inputs through tile-coding. (Courtesy of Albus’s CMAC, 1975.)

As we get more and more feature-variables, we will face the curse-of-dimensionality. The curser-of-dimensionality refers to the exponentially increase in the feature vector space as feature variables increase.

Dealing with the Curse-Of-Dimensionality:

There are two effective methods to deal with this problem:

Hashing Trick to limit the memory size allocations
Tiling on individual and (some of ) pairs of individual variables (known as feature crossing which consider second order non-linearities).

Due to the sparse representations, empirically we have observed that the effectiveness of such approximations can lift the curse to a good extend. Figure 4., demonstrates expansion and then projection (embedding) of the input-layer using tile-coding in conjunction with deep neural networks to capture possible non-linearities we might have lost due to the first or second order tiling mechanism.

Figure 4. In simplified tiling mechanism (on one or pairs of individual variables), non-linear dependencies may be captured by adding hidden layers in a deep neural network architecture.

This method dramatically reduces the size of feature vectors and thus number of learning parameters. Of course, it may lose some non-linear dependency in the input-data but we may hope that using hidden layers in DNNs can capture none-linearities in a good extend.

A Simple Example for Tile-Coding: Predicting a Non-Linear Function with Linear Regression method

Let us start with a very simple example before using a real dataset, to show the effectiveness of tile-coding compared to the bucketization technique. Here, we would like to use tile-coding in conjunction with linear regression to estimate a highly non-linear function!

As an example, we aim to learn the function, y=sin(x). We consider the problem as supervised linear regression, and thus generate I.I.D samples in the for of (x,y), from the function, to form training and testing datasets. We use simple stochastic gradient-descent approach for the update of learning parameters (also known as LMS or delta rule). Ultimately we use root-mean-square error (RMSE) as our metric to measure the learning.

Figure 5., shows that the tile-coding approach is much more efficient than that of bucketization method; even for the case where the size of the feature-vector — model size–is the same for both (20 × 16).

Figure 5. Comparison between bucketization and tile-coding methods in a simple supervised setting for predicting highly non-linear function using linear regression.

Empirical Result on a Real Data-Set:

Here we use wine-quality UCI dataset as a second example for a classification problem. The dataset is consist of 11 real-valued feature variables labeled with the score quality between 0 to 10. Here, the feature variables are multi-modals with different unit values. Thus, for a better performance, we would need to normalize them before using a simple logistic regression method for this classification problem. Because labels are based on the score given to the wine then there is class overlapping between in the dataset and thus the classification problem is fairly hard. (One way to enhance prediction, for identifying high quality vs low quality, is to bucketize the score values in three buckets of high, medium and and low quality and remove the medium quality dataset and turn the problem into a binary classification problem. But our purpose here is to show how tile-coding can enhance the results we are getting from bucketization method.) However, here we will use the 10 classes as labels and aim to show the value of tile-coding vs bucketization for this complex case. We use the classical cross-entropy loss function as our metric.

After splitting the dataset into 70%-30% for train-test, we conducted both bucketiziation and tile-coding on each feature variable separately and used 10 tilings. We trained Logistic Regression with AdagardOptimizer on TensorFlow, using the Tile-Coding mechanism we developed (code). The final evaluation result (the end-Points in Figure. 6), obtained on the entire test dataset, demonstrates not only we get a significant improvement using tile-coding (+10% without efforts on tuning parameters), but also we observe that the speed of convergence, in the early stage of iterations, looks better. We think, it might be due to a better generalization ability of tile-coding.

Figure 6. Comparison of tile-coding vs bucketization method for red-winequality dataset. Tile-coding performs significantly better than the bucketiziation approach. The blue line represent the bucketization method and the green line represent tile-coding. The final evaluation result (on test dataset) has been demonstrate at the end of iteration in the graph.

Implementation of the code in TensorFlow:

The tile-coding now is open-sourced in the following link: https://github.com/criteo-research/tf-tile

Acknowledgement:

The author thanks Suju Rajan, Sathiya (Keerthi) Selvaraj, Christophe Renaudin, Piyush Narang, and Thomas Ricatte for their comments and discussions.

References:

Albus J S 1975 A new approach to manipulator control: The cerebellar model articulation controller (CMAC).Trans. ASME, J. Dyn. Syst., Meas., Contr. 97:220–227
Sutton and Barto (2018), Introduction to Reinforcement Learning. MIT Press.