Data Representation Design Patterns

Manoj Kumar Patra
Geek Culture

--

Data representation means representing real world data being fed to the model (input) in a transformed format that the model actually operates on (feature).

Some simple data representation techniques include:

#1. Scaling numerical inputs in the range [-1, 1] :

This helps in two ways:

  1. faster convergence and thus, models are faster/cheaper to train
  2. negligible difference in the relative magnitude of weights of different features => almost similar L1/L2 regularisation

There are four ways in which we can scale numerical inputs that are linear:

  1. Min-max scaling:
    x1_scaled = (2*x1 - max_x1 - min_x1)/(max_x1 - min_x1)
    😞 Downside: The min and max values are determined directly from the training set.
  2. Clipping (in conjunction with min-max scaling):
    Unlike min-max scaling, the min and max values here are reasonable estimates. The data is linearly scaled between these values and then clipped between [-1, 1] .
    -1 s and 1 s generally come out to be outliers. 😃
    Clipping works well on uniformly distributed data.
  3. Z-score normalisation:
    x1_scaled = (x1 - mean_x1)/stddev_x1
    This type of scaling results in zero mean and unit variance over the training dataset.
    The scaled value is unbounded but lies between [-1, 1] 67% of the time if it’s a normal distribution.
  4. Winsorizing:
    Clips the dataset in terms of percentiles such as between 10th and 90th percentile or 5th and 95th percentile, etc.

For data that is skewed, we need to do transformation on the data before scaling. Some transformation techniques generally used include log transformation, bucketize inputs, or a parametric transformation technique like box-cox transformation

#2. Representing an array of numbers input:

  1. by its statistics (such as mean, median, etc.),
  2. by its empirical distribution
  3. by a fixed number of items in the array if the array is ordered in a certain way.

#3. One-hot encoding or dummy encoding categorical inputs

Dummy encoding is preferred when the inputs are linearly independent.

#4. Treating numerical inputs as categorical and map to a one-hot encoded column

  1. When the numerical input is just an index, e.g., days of a week
  2. When the relationship between input and label is not continuous, e.g., traffic levels on Friday is not the same as traffic levels on Monday
  3. Bucketing the numerical variables, e.g., Monday through Friday are weekdays and, Saturday and Sunday are weekend.
    ❗️ This does lead to some loss in ordinal nature of the inputs.
  4. When different values of the numeric input have different effects on the label, this is a case of categorising numerical inputs, e.g., when trying to determine if a baby is born healthy, a baby with weight x born as a triplet is considered healthier than a baby with the same weight born as a twin. Here, plurality is something that can be categorised as it has a direct effect on the label.

#5. Array of categorical variables can be represented as follows:

  1. Counting the number of occurrences of each term in the array
  2. Use relative frequency instead of counts to avoid large numbers
  3. Representing the input array by a fixed number of items, if the array is ordered
  4. Representing the array in statistics

Hashed Feature Design Pattern

Bucketize the inputs using a hash algorithm like Farm Fingerprint as follows:

ABS(MOD(FARM_FINGERPRINT(INPUT), NUM_BUCKETS))

This design pattern should be used when

  1. Categorical features have incomplete vocabulary
  2. Model size is large due to cardinality (💡cardinality refers to the number of unique values contained in a particular column)
  3. Cold start — 💡concerns the issue that the system cannot draw any inferences for inputs about which it has not yet gathered sufficient information. A model in production unable to make prediction on new data.

❌Cryptographic hash algorithms should be avoided. Why ❓

We need the hashes to be deterministic and unique.

❓How do we choose the number of buckets?

✔️ In general, choose the number of hash buckets such that each bucket gets about five entries: This will give good results initially, however, periodical retraining is required to improve the results over time.

A better approach would be to use bucket size as a hyper parameter and monitor the value that works best for the problem at hand. 💯

😞 However, there are trade-offs like:

  1. Loss in model accuracy, more pronounced when categorical input distribution is highly skewed
  2. Bucket collision when the number of buckets is small
  3. Empty hash buckets

To deal with trade-offs 1 and 2, we can add an aggregate feature as an input to the model to avoid losing information about the individual inputs.

To overcome trade-off 3, when using hashed feature columns, we can use L2 regularisation to bring the weights associated with the empty bucket to near zero.

Embeddings Design Pattern

Embeddings are a learnable data representation mapping high-cardinality data to a low-dimensional space without losing information.

❓Why not one-hot encoding?

  1. One-hot encoding high-cardinality categorical features leads to a sparse matrix which doesn’t work well with ML algorithms.
  2. One-hot encoding treats categorical variables as being independent. So, we can’t capture the relationship between different variables using one-hot encoding.

Embedding solves this problem by capturing closeness relationships between the variables in a lower-dimensional space.

❓Can embedding be a replacement for clustering or PCA?

Yes, in fact, embedding weights can be determined in the main model training loop, unlike clustering or PCA which needs to be done beforehand.

Trade-off: Loss in information to a certain extent

Hyperparameter: Embedding dimension

Rule of thumb: Use the fourth root of the total number of unique categorical elements or should be approximately 1.6 times the square root of the number of unique elements in the category, and no less than 600.

This design pattern can be used in

  1. text embedding in a classification problem based on text inputs,
  2. image embedding
  3. training an auto-encoder for image embedding where the feature and the label are the same and the loss is the reconstruction error. This allows the auto-encoder to achieve nonlinear dimension reduction.
    An auto-encoder maps a high-dimensional input to a lower dimension space, applies embedding to learn closeness relationship among the variables in a similar lower dimension space and then maps it back to the high dimensional space. Here, the model is trained to learn the weights for embedding a lower dimension to another lower dimension. As a result, model has better performance.
  4. Context language models such as Word2Vec (embeddings are the same regardless of the usage) and BERT (embeddings are contextual) do the same as auto-encoders but for text.

Feature Cross Design Pattern

This pattern helps models learn relationships between inputs faster by explicitly making each combination of input values a separate feature.

Using feature crosses can

  1. help in solving certain problems using just a linear model
  2. speed up training (like really fast compared to DNN if amounted for the right feature crosses)
  3. less training data=> less model complexity

❓Can it be applied to numerical features?

Yes. To apply feature cross to numerical features, we can bucketize the feature values.

Trade-offs:

  1. Feature crosses result in sparse vectors.
    SOLUTION 😃 We can use an embedding layer after the feature cross to reduce it to a lower dimension and yet, maintain the closeness relationship.
  2. When feature crossing two categorical features with high cardinality, cardinality increases dramatically. To avoid this, we can either apply L1 regularization (encourages sparsity of features) or L2 regularization (limits overfitting.

Multimodal Input Design Pattern

This design pattern addresses the problem of representing different types of data or data that can be expressed in complex ways by concatenating all the available data representations.

For example, predicting traffic violation based on the image and time of the day, or predicting user rating for a restaurant based on user review, amount paid and time of the day — deals with represent different data types together.

In the restaurant review example, we can also represent user rating in a way that contains both the rating and also says if it’s good or bad — an example of representing the data in a complex way.

Representing text data in multiple ways

  1. BOW encoding
  2. Embedding — identify relationships between words by taking into account order or meaning of words in a text document
  3. Extracting tabular data from text and concatenating with BOW encoding/embedding

Representing image data in multiple ways

  1. Pixel values — an example would be representing a 28×28-pixel black-and-white image in a model as a 28×28 array with integer values ranging from 0 to 255
  2. Tiled structures — useful in extracting meaningful details and underlying patterns
    One way to achieve this would be using a CNN with max-pooling, with/without overlap.
  3. Combination of both pixel values and tiled structures
  4. Using images with metadata (for example, time of the day is a metadata in the traffic violation prediction task)

These are my notes from the book Machine Learning Design Patterns. For an in-depth understanding, I would recommend reading the book. The link is available below.

Next we will look at Problem Representation Design Patterns.

--

--

Manoj Kumar Patra
Geek Culture

Software Engineer | Powering Front-end at growing startups | Curiosity towards Data Science