One Hot Encoding with Ibis: An Explanation & Tutorial

Master converting categorical data to binary vectors using Ibis and SQL for real-world data analysis and machine learning applications.

Tyler White
Learning The Computers
7 min readAug 7, 2023

--

Midjourney — New Year Resolution + Mixed with DALL·E 2 Ibis, One Hot Encoding, and Machine Learning

Let’s delve into the fascinating world of converting categorical variables into numerical values — a critical process in data analysis and machine learning. One hot encoding, or dummy variable creation, is a popular and effective method to achieve this transformation. In this article, we’ll explore the concept of one hot encoding, taking you through its applications in different scenarios.

This post will focus on performing one hot encoding using Ibis, which pushes down SQL to the backend of your choice. Before we get to the Ibis part, we will walk through several other examples of performing one hot encoding with Pandas, Numpy, and SQL.

By the end of our exploration, you’ll gain a comprehensive understanding of this technique, empowering you to apply it in your data projects confidently.

Let’s embark on this one hot encoding journey and uncover its significance in data analysis and machine learning. Let’s get started!

If you’d like to skip ahead and check out the code to make this work, you can find a Hex application published at the bottom of the article.

One Hot Encoding — A Brief Explanation

One hot encoding is a method used to transform nominal categorical variables into numerical values suitable for machine learning models. Nominal categorical variables represent various labels or categories with no inherent numerical relationship. Such variables include colors (red, blue, and green) and car types (sedans, SUVs, and trucks). However, it’s essential to note that not all categorical variables are treated similarly. Some categorical variables, such as diamond cut quality, are ordinal, where there exists a meaningful order or ranking among the categories (ranging from fair to ideal).

There are multiple encoding options available beyond this one. The appropriate encoding method depends on various factors, such as the data's nature, the categorical variable type, the problem context, and potential interactions with other variables. It is crucial to comprehend the effects of each method and how they could impact the downstream analysis or machine learning model.

The 🛥️ Dataset

In this article, we will focus on analyzing the well-known Titanic dataset. This dataset includes details regarding the passengers aboard the RMS Titanic during its inaugural journey. Unfortunately, the ship's voyage came to a devastating conclusion when it struck an iceberg on April 15, 1912, leading to its sinking.

If you wish to follow along, it’s easy to get started as this dataset is widely available.

Loading and repr’ing the table.

From the dataset, we can gather information on the passengers on the RMS Titanic. This includes their unique PassengerId, whether they survived or not (Survived: 0 = did not survive, 1 = survived), their ticket class (Pclass: 1st, 2nd, 3rd), name, gender, age, number of siblings/spouses (SibSp) and parents/children (Parch) aboard, ticket number (Ticket), fare paid (Fare), cabin number (Cabin), and port of embarkation (Embarked: C = Cherbourg, Q = Queenstown, S = Southampton). With these features, we can analyze the factors influencing survival rates and use them to create predictive models to determine passengers' survival based on their characteristics.

In this example, we will convert the "Embarked " column into numerical values using one-hot encoding to use the column in a machine-learning model.

All, All The Ways, Ways…

We could perform this encoding using Pandas, scikit-learn, or SQL with many case expressions.

Let’s see what that looks like.

Pandas

Getting dummies with Pandas.
  • pd.get_dummies(): This method is utilized to conduct one-hot encoding on categorical data in the DataFrame. It transforms categorical data into binary columns, where each column denotes a distinct category, and a binary value of 0 or 1 signifies whether that category is present for a given row.
  • df: This is the DataFrame which holds the Titanic dataset that was retrieved and transformed from Ibis to Pandas.
  • dummy_na=True: Creates a dummy variable for missing values (NaN or None) in the "Embarked" column.
  • columns=["Embarked"]: The argument provided indicates which columns in the DataFrame should undergo one-hot encoding. This particular scenario pertains to the "Embarked" column in the DataFrame.
  • drop_first=True: The parameter prevents multicollinearity in analyses and machine learning. By using k-1 dummy columns, linearly predicted variables are avoided. The first dummy column is dropped, and the remaining k-1 captures the information. This is useful for algorithms that assume linear independence between features. For example, if a passenger embarked from Cherbourg, the absence of 1s in the Q and S columns would indicate it.

The result of pd.get_dummies(df, dummy_na=True, columns=["Embarked"], drop_first=True) will be a new DataFrame with additional columns representing the one-hot encoded version of the "Embarked" column while dropping the original column. The number of extra columns will vary based on how many distinct categories are in the original "Embarked" column. Every row in the new DataFrame will display a 1 in the appropriate category column and a 0 in all other columns, indicating the category for that particular entry.

scikit-learn

One hot encoding with scikit-learn.
  • To encode the column, we need to extract the "Embarked" column from the DataFrame and assign it to a variable, categories. This is necessary because scikit-learn's algorithm requires a 2D array-like input.
  • We create an instance of OneHotEncoder with specified categories and drop parameters. The categories parameter specifies the unique values for each category (in this case, the unique values of “Embarked”), and the drop parameter is set to “first” to drop the first category in each feature. This is done to avoid the multicollinearity issue mentioned in the previous section.
  • Using the fit_transform method, we fit the encoder on the “Embarked” column, which returns a sparse matrix representation of the one-hot encoded data.
  • Creating a DataFrame from the Encoded Sparse Matrix: We convert the sparse matrix obtained from the encoder into a dense DataFrame using toarray(). The get_feature_names_out() method is used to get the names of the new feature columns.

We concatenate the original DataFrame df with the newly created encoded_df DataFrame containing the one-hot encoded columns. We can remove the original “Embarked” column from the DataFrame by using the drop method with axis=1 to drop the column.

The final DataFrame will have the one-hot encoded representation of the "Embarked" column and the rest of the original columns from the dataset.

SQL — Case

CASE WHEN CASE END AS SQL CASE statements.

With SQL, we can perform the One-Hot Encoding using CASE statements. We can use CASE statements to create new columns for each unique category in the “Embarked” column and assigns a value of 1 if the row belongs to that category and 0 otherwise. As you may expect, the statement increases in length based on the unique number of variables needing to be encoded.

Finally… Ibis. I hope it’s been worth the wait to get to this part. Unless you just scrolled past the rest. But still, thank you for being here.

Ibis offers a plethora of useful generic expression APIs to push down these operations.

We can add columns to the table expression using mutate. As you might imagine, the case API will be useful here.

Ibis on the “case.”

That’s still not very dynamic. We still need to know value values to consider for the case expression. This seemed like a nice time to apply a selector.

We can store and persist the unique categories in a list to dynamically iterate the case expression. I am emulating the behavior of Pandas and scikit-learn from the previous steps by slicing the list. Using across we can apply a function specified in the second argument to the selector specified in the first. In this case, it will only target just the “Embarked” column, but we want to apply our case expression multiple times against it to create new columns from the values in the list.

Ibis on the “case,” now with a selector expression.

We can even check out the SQL that it generated and executed.

ibis.show_sql on the previous expression

Conclusion

One hot encoding is an essential technique used in data analysis and machine learning. Different one-hot encoding techniques:

  1. pd.get_dummies: Using Pandas’ pd.get_dummies function to directly encode the "Embarked" column of the DataFrame into binary columns.
  2. scikit-learn OneHotEncoder: Utilizing scikit-learn’s OneHotEncoder with ColumnTransformer to encode the "Embarked" column.
  3. SQL with CASE expression: Using a CASE expression with aggregation (SUM) to encode "Embarked" categories into additional columns.
  4. Ibis’ mutate, case, across, and selector combinations to express relatively complex SQL.

All techniques converted the categorical variable “Embarked” into binary columns representing unique categories. The choice depends on tools, dataset size, and specific use cases.

Adopting Python standards has become more accessible with Ibis, regardless of your backend. You can easily try Ibis by installing it and getting started.

Here’s the Hex Project if you would like to check it out for yourself and give it a try!

--

--

Tyler White
Learning The Computers

I constantly seek new ways to learn, improve, and share my knowledge and experiences. Solutions Architect @ Snowflake.