Scikit-learn’s FeatureHasher

Chituyi
4 min readFeb 29, 2024

--

What is FeatureHasher?

Where data often explodes in dimensionality, feature hashing has emerged as a powerful technique to tackle challenges related to storage, computational efficiency, and model complexity. Scikit-learn, a renowned machine learning library, offers a convenient implementation called FeatureHasher.

FeatureHasher implements feature hashing, also known as the “hashing trick.” It transforms sequences of symbolic feature names (strings) into sparse matrices using a hash function. The resulting matrix columns correspond to feature names. The hash function employed is the signed 32-bit version of Murmurhash3. Feature names of type byte string are used as-is, and Unicode strings are converted to UTF-8 without normalization. Feature values must be finite numbers.

FYI, “MurmurHash3 is a non-cryptographic hash function designed for general hash-based lookup. Created by Austin Appleby in 2008, it’s suitable for tasks like hash tables and bloom filters. The name comes from the basic operations used in its inner loop. multiply (MU) and rotate ®. Unlike cryptographic hashes, it’s not meant to be resistant to reverse engineering. The current version yields either a 32-bit or 128-bit hash value, and it’s available in various programming languages”

The Rationale Behind Feature Hashing

High-dimensional data, characterized by an extensive number of features, can lead to increased computational costs, longer training times, and potential overfitting of machine learning models. Feature hashing ingeniously tackles these issues by employing a hash function to map features to a fixed-size array of integers (buckets). This ingenious process drastically reduces the feature space dimension without explicitly storing each feature, making it a valuable tool for scenarios with memory limitations.

The primary motivation behind FeatureHasher is to handle large-scale (online) learning scenarios and situations with memory constraints. It’s a memory-efficient alternative to DictVectorizer and CountVectorizer.

How Does It Work?

· Feature Names and Hashing. Given feature names (e.g., words or tokens), FeatureHasher applies a hash function to determine the column index in the resulting matrix. This hash-based approach avoids maintaining a dictionary of encountered features during training.

· Sparse Matrices. The output is a sparse matrix, where each row corresponds to an input sample, and columns represent hashed features. The hash values directly serve as feature indices.

· Alternate Sign (Optional). An alternating sign can be added to features to conserve the inner product in the hashed space, even for small feature dimensions.

from sklearn.feature_extraction import FeatureHasher

h = FeatureHasher(n_features=10)

D = [{'dog': 1, 'cat': 2, 'elephant': 4}, {'dog': 2, 'run': 5}]

# Transform the data
f = h.transform(D)

# Convert the sparse matrix to an array
print(f.toarray())

Advantages.

· Memory Efficiency. Ideal for situations with limited memory resources.

· Scalability. Handles large-scale data efficiently.

· Stateless. No need for fitting; it’s ready for use.

Shortcomings.

· Loss of Inspect ability. Unlike vectorizers, FeatureHasher doesn’t remember the original features, making it less interpretable.

· Hash Collisions. Smaller feature space may lead to hash collisions, affecting uniqueness.

Before we go any further let’s understand Sparse Matrices!

  • Dense Matrices. In a dense matrix, every element is stored explicitly, even if it’s a zero. Imagine a massive table where each cell holds a value, regardless of whether it’s relevant or not. This can be extremely wasteful, especially when most elements are zeros (as is often the case after feature hashing).
  • Sparse Matrices. Sparse matrices take a different approach. They only store the non-zero elements and their locations (row and column indices) within the matrix. This is like having a list of only the occupied cells in our giant table, along with their coordinates. This clever storage mechanism brings about substantial benefits.

Advantages for Storage.

  • Memory Savings. For datasets with a high degree of sparsity (lots of zeros), sparse matrices require significantly less memory than dense matrices. This is because we’re not wasting space storing countless zeros.
  • Reduced Disk Space. When storing datasets on disk, sparse matrix formats are much more compact than their dense counterparts, leading to reduced storage costs and faster data loading/saving.

For instance.

Consider a text dataset with 10,000 unique words (features) where each document typically uses only a few hundred words. If we one-hot encode this dataset, we’d end up with a massive dense matrix filled mostly with zeros. In contrast, feature hashing would produce a sparse matrix where only the active features (hashed word indices) and their corresponding values would be stored.

Advantages for Computation.

  • Faster Operations. Many mathematical operations, such as matrix multiplication and linear algebra calculations, can be optimized for sparse matrices. Specialized algorithms can focus computations only on the non-zero elements, leading to significant speedups compared to operations on dense matrices.
  • Avoidance of Unnecessary Calculations. In dense matrices, operations involve calculations with zeros, which are essentially wasted effort. Sparse matrices skip these unnecessary calculations, further improving efficiency.
  • Efficient Data Structures. Sparse matrices are often implemented using specialized data structures like compressed sparse row (CSR) or compressed sparse column (CSC). These data structures enable fast access to non-zero elements and efficient operations.

Consider using Feature Hashing in the following situations.

  • High-Dimensional Data. When your dataset boasts an overwhelming number of features, such as in text analysis with massive vocabularies or categorical data with numerous unique values.
  • Real-Time Applications. If your application demands rapid feature extraction and model training, feature hashing’s efficiency becomes crucial.
  • Memory Constraints. When memory is a bottleneck, the compact representation offered by feature hashing can be a lifesaver.

In conclusion, Scikit-learn’s Feature Hasher provides an elegant and efficient solution for handling high-dimensional data. Its versatility extends beyond text data, even encompassing audio encoding for neural networks. By adopting this tool, we open doors to new opportunities in machine learning and data analysis.

Check out free ML projects with Code to get started here!

https://dallo7.github.io/

#MLDemocratizer!

--

--

Chituyi

Building data Pipelines for ML and AI to aid Supply Chain Agility and improve Customer Intimacy. https://dallo7.github.io/