Feature Engineering on the Modern Data Stack

Jordan Volz
17 min readMar 3, 2022

--

Feature engineering is a crucial part of any ML workflow. At Continual, we believe that it is actually the most impactful part of the ML process and the one that should have the most human intervention applied to it. However, in ML literature, the term is often overloaded among several different topics, and we wanted to provide a bit of guidance for users of Continual in navigating this concept. In this article, we’ll break down feature engineering into several different concepts and provide our guidance on each.

Overview

Feature engineering is often ambiguously defined, and this can often lead to confusion for data workers. What exactly is feature engineering? Simply put, feature engineering is the process of taking raw data and constructing inputs to machine learning models. In reality, this is not really just one topic, but many techniques coming together to form a utility-belt of sorts for users. In particular, we believe that the main components of feature engineering are:

  • Feature Creation
  • Feature Encoding
  • Feature Extraction
  • Feature Selection
  • Prediction Engineering
  • Automated Feature Engineering

Historically, many ML platforms do not do a good job of differentiating between these tasks. The status quo for Generation 1 and Generation 2 platforms is something like the following:

  1. Connect to and pull data from a source system.
  2. Write a bunch of Python code to transform data.
  3. Build an ML Model.
  4. Throw a notebook “over the fence” for someone else to operationalize.

All the various parts of feature engineering are contained with step #2, and making sense of it is dependent upon the reader’s ability to interpret code and the author’s ability and desire to communicate effectively in code. The result is something like a single DataFrame, which is unique for each use case and comes with a unique pipeline. Furthermore, operationalizing each use case has its own obstacles that inevitably delay projects and cause maintenance headaches down the road. When we start to take a step back and look at the ML process as a whole, it’s not difficult to see why so many companies struggle to get value out of ML with Gen 1 and Gen 2 platforms.

Data-first ML platforms like Continual are on the rise in Generation 3. A Gen 3 platform puts data at the center of the ML workflow and builds a coherent production process around it. As previously discussed, feature stores are a crucial component to a Gen 3 architecture and this serves as a key piece of technology that allows us to break down feature engineering into discrete parts and build robust and automated processes around each. Without a feature store, a lot of this would not be possible.

Below, we’ll dive into each topic in much more detail, paying particularly close attention to how each should be dealt with in an ideal Gen 3 platform. Are these all steps that require manual intervention and coding? Or, can we learn from best practices cultivated over the years and start to enable users to take the “fast path” when interested? As a spoiler, our thesis is that a well-designed modern ML platform should have users focus on creating features (via feature creation) and targets (via prediction engineering), and a lot of the rest should be automated away with escape hatches made available for advanced users.

Feature Creation

Feature creation is the process of creating new features from existing data. This will typically include operations like joining data together, creating window functions, aggregating data across groups, etc. These are all outside a simple shift in the representation of a feature; i.e., we’re not simply looking at existing data in a different way, we’re actually creating something new.

For example, I may have a table that contains all the raw data around how users interact with my website. One of my first tasks may be to sessionize this data to provide better context around what users accomplished during their time on our website. I.e. how much time was spent in total, the number of clicks, the percent of pages scrolled down, etc. It’s easy to dig into this data and start thinking of useful features to construct. As I start building ML models I may want to execute additional operations like summarizing user engagement over a rolling 3-month period, etc. In this workflow I am explicitly creating many new pieces of data from existing data: this is feature creation.

In the above example, we can turn web events into sessions and build features, like links_clicked, on the lifespan of the session.

A colleague of mine once joked that “feature engineering is basically data engineering that your data engineering team doesn’t want to do.” Indeed, feature creation is very similar to data engineering. Data engineering is focused on transforming raw data into useble data in analytical and operational use cases, and there are clearly many similarities between data engineering and feature creation. Feature creation can be viewed as a “last-mile” problem for the data engineer or a “first-mile” problem for a data scientist, and this is something that blurs the responsibilities between the two. Even though we may expect this overlap to be a great opportunity for collaboration, what’s most curious about this is that it has largely manifested itself as a point of great contention for data teams.

You may have been fortunate enough to hear the cries from the gallows in years past concerning how “data scientists spend X%+ of their time on feature engineering tasks”. Yes, it’s sad that they don’t get to spend more time doing data science, but this really highlights a misalignment of workflows. Data scientists primarily work in Python because, when working in a Gen 1 or Gen 2 workflow, it’s inevitable that they’ll have to use Python to build a model. Thus, it doesn’t make a lot of sense to force them to use SQL or Java/Scala just so they are in alignment with the data engineering team, only to then turn to Python in the next step of the process. What arises out of this conflict is that data science teams end up: performing their own feature engineering, duplicating a lot of work of the data engineering team, duplicating a lot of data as well, and maintaining their own pipelines. After a while, we have a real “pipeline jungle” on our hands. Again, these pipelines are rarely robust or easy to productionalize and maintain, so ML use cases will stall under these constraints.

This conflict is often erroneously cast as a “SQL vs Python” debate, but we think these two are merely innocent bystanders and the real culprit is the adoption of ML platforms that force fissures between different parts of the data team. Gen 3 platforms try to fix this. A feature store is the very thing that resolves this conflict: it acts as an abstraction layer that separates out the creation of features from the consumption of features — previously which were tied together in Python scripts — so it’s now rather immaterial how the features are created. The data engineering team should be populating the feature store with a lot of data, and data scientists can do so as well when they have advanced operations to register and share.

The next step is to unite both teams under a common tool for feature creation. At Continual, we believe that dbt can be that tool. Many Gen 1 and Gen 2 platforms inevitably have gone down the path of building out their own “data wrangling” capabilities and pitched it as an awesome feature, but we’ve always thought this is really just creating more friction in data teams. These tools are typically not robust enough for pure data engineering teams, lack things like CI/CD integration, and inevitably fail to offer real value. dbt has already solidified itself as a premier tool for data engineers and is one of the hottest open-source tools currently on the market. For data scientists, what’s missing is a quick hook into a feature store and the ML workflow — this is exactly what Continual offers with its dbt integration. This is the vision we have: uniting data teams under a common workflow.

Pairing up Continual and dbt provides full coverage along the ML workflow.

The main objection to adopting dbt as the primary feature creation tool for data science teams is that they don’t wish to be limited to ‘only’ using SQL. As we’ll explore in the rest of this blog, we believe that many of the Python-esque tasks a user may wish to do in the ML workflow can be handled by a well-designed declarative system that morphs it from a coding task to a configuration decision — this is the insight that teams at Apple, Uber, and others have had. However, it’s also possible that dbt may not always have this restriction; recent comments by dbt Labs CEO Tristan Handy indicate that we may start seeing non-SQL support in dbt. Imagine being able to push down Pyspark to Databricks or pure Python into Coiled, etc.

As users start working with Continual, they’ll notice a shift in the focus in the ML workflow from model building, evaluation, and tracking to feature construction. We think this is where users should spend most of their time. In doing so, companies build up a wealth of data knowledge that can quickly be applied across use cases. Continual focuses on automating the workflow after this point, and it is very easy to build and iterate on models in minutes. As we start working with customers on their use cases, it’s not uncommon that we’re building dozens of models to try out different scenarios, and doing so all within a matter of minutes. We think this workflow represents a huge unlock for companies and allows them to adopt ML into their company at a record pace.

Feature Encoding

Next on the list, Feature Encoding is the process of transforming raw features into new features that are better suited for model building. If that sounds similar to feature creation above, let’s point out that the main difference is that we’re not adding any new information in feature encoding; rather, we’re just changing the representation of a feature into something that the ML algorithm prefers to handle.

For example, as I collect data to use as inputs for a machine learning model, it’s common that different numerical data will have different scales associated with them. Perhaps we have something like “age of account,” which is in single digits, “age of customer,” which is in double digits, “account balance,” which is anywhere from three digits to seven digits, and so on. It’s common practice to apply scaling and normalization to numerical inputs so that they are all mapped to values on the same scale (typically (-1,1)). Other types of encoding can occur with other types of input. Categorical variables are typically hashed or one-hot encoded. Strings are often tokenized or indexed, etc. Why we do this is outside the scope of this article, but suffice to say that most ML algorithms expect that inputs will be provided in this way. Without doing this, models can produce poor results.

In the above example, we encode the state column with a super simple hash function.

Feature encoding has largely been a task of the data scientist. There are a lot of nuances here and different ML frameworks may expect that data scientists are doing specific things to data before plugging them in. Many ML libraries have lots of built-in functions that help to do these things, but applying the right encoding to the right feature is something that is squarely in the data scientist’s wheelhouse.

If one is trying to “democratize” their ML system, feature encoding is the first bump in the road they will hit. We wouldn’t expect non-DS users to know how to do these operations, nor really know which frameworks expect which encodings to use. However, this is one area where having an AutoML framework can be super useful. A good AutoML framework will perform all required encodings for users before plugging features into models. Given that an AutoML framework is likely working with many different algorithms and frameworks, it’s a non-trivial task to ensure that encodings are done properly, but this is a huge value add and also allows non-DS to get further into the ML workflow.

An ideal Gen 3 system will automate feature encodings based on input types from the user. The platform developers should be able to understand which encodings are needed for algorithms that will be used during the experimentation phase, and encodings should be done automatically based on the input types provided. For advanced use cases, it may be necessary to provide escape hatches to users to change the system’s behavior as well. For users of Continual, encodings are automatically selected based on the input types of features. Types are automatically inferred from the data types of the underlying data warehouse, but users can always override type association via their feature set and model definitions.

Feature Extraction

Feature extraction is the process of representing data with a smaller amount of features while still retaining the core information of the data set. This is closely related to dimensionality reduction and is largely applied to complex data types like text, images, video, and audio. Popular extraction techniques include Principal Component Analysis, Singular Value Decomposition, Word2Vec, BERT, fastText, and others; many of these are known as embeddings or transformers.

Whereas feature encoding works on each feature value individually, feature extraction techniques work on the entire data set as a whole, with the goal of reducing the complexity of the dataset for use in a subsequent ML model. Embeddings and transformers are often pre-trained, meaning that they have already been trained on a particular task like facial recognition (FaceNet) or natural language processing (BERT). Users can then leverage these pre-trained models in their own ML pipelines via transfer learning. This is not only a huge productivity boon but also a nice economical boost, as many pre-trained models are not cheap to recreate.

For example, let’s say we are trying to build a model that uses images as input. Our images are 2048 x 1536 pixels, and each pixel is stored in RGB color format, meaning that an image is represented digitally as over 9 million integers(!). Clearly, this is not going to be an efficient way to work with data. However, if we gave this task to a human and told them we were trying to build a facial recognition system, they may decide to represent each image as a collection of features: hair color, the distance between the eyes, the distance between the nose and skin, the thickness of hair, etc. After building a few dozen features, we would probably have a fairly good system that we could use to classify faces, and we would have done so using much fewer than 9 million features. At its core, this is what feature extraction does: it takes a highly complex piece of data, like text or an image, and is able to extract a smaller set of features that still accurately represents the data.

Feature Extraction can be used for Machine Learning and Deep Learning problems. Image via ResearchGate.

Feature Extraction can be used for Machine Learning and Deep Learning problems. Image via ResearchGate.

Feature extraction gets complicated very quickly. These are topics that even data scientists struggle with, and, although they can be absolutely essential to get good results when working with complex data, they can still trip up experts when it comes to execution. We believe a good Gen 3 platform will approach feature extraction similarly to encoding: the implementation should be automated and users should be able to switch them on and off as needed per use case via configuration. Also, note that feature extraction is generally associated with complex data types. For users who are executing ML on top of tabular data, feature extraction is likely something that is not needed.

Feature Selection

Feature Selection is the process of choosing a subset of features to use in ML model construction. For example, I may initially choose to include 100 features in a model, but then narrow the scope to 20 for production use. Why would we reduce the number of features we are using in a model? Aren’t more features always better? There are various arguments for using feature selection, including simplifying the model, reducing training times, and trying to prevent overfitting by reducing the dimensionality of the feature space.

There are many different techniques for feature selection. Some are built into the algorithm, such as random forest, which has a parameter for the maximum number of features to use, and LASSO regression, which adds an L1 penalty that begins to reduce the contribution of unimportant features down to 0. Others are independent of the model. For example, we could employ a method to score subsets of features until an optimal combination is discovered, as is done in stepwise regression, or just run a statistical test between features and the target and select a limited number of features based on the result, as is sometimes done with the chi-squared test.

It’s important to note that model performance is not one of the known benefits of model selection. This is kind of a controversial topic, but suffice to say that there isn’t general consensus that we need to do feature selection to get better models, and it seems that several high-tech companies may actively not be engaging in this behavior. Additionally, it’s potentially really detrimental to model performance if feature selection is not done correctly. At this time, a good strategy is to utilize any built-in feature selection in ML algorithms and also do a manual inspection of model results to see if any additional intervention is needed.

With Continual, we generally recommend the following:

  1. It’s safe to remove features that have negative feature importance. These are known to be harming model performance.
  2. It’s also safe to remove features that are strongly correlated with other features. Features that have a strong correlation with another feature are likely not contributing much to model performance but they may be reducing the quality of the feature importance analysis.
  3. Investigate the results of data profiling and data checks for any model version and use this information to take appropriate actions. I.E. if a column is largely null values or if a feature set’s time index has little overlap with the model definition — these are things that can likely be dropped from the model.

Prediction Engineering

Feature creation focuses on constructing inputs to machine learning models, but prediction engineering focuses on creating the labels. This is a newer trend in machine learning that has evolved primarily out of the creation of feature stores. Before feature stores, there was never really a need to create features separately from labels. For any use case, both features and labels are needed so they would all be created together in the same workflow. With a feature store, however, features can be freely generated independently of models that use them, so it opens up a new field of engineering the model labels independently of features. It might seem a little odd to include prediction engineering as part of “feature engineering,” but the workflow is very similar and it is a distinct task separately from model training itself.

The truth of prediction engineering is that this is the defining quality of an ML use case. Whereas features can be used between different ML models, the label itself will be unique per model. For some use cases, this will be as easy as a column look-up, whereas others may require more deliberate consideration. Recently we wrote about constructing labels for time-based customer churn use cases, with the process summarized below:

Defining churn on a data requires business context and the ability to understand and reason about compelling events in the sales process.

Prediction engineering and labeling solutions have also been on the rise in recent years, with open source projects and commercial solutions, like Labelbox (images) and Snorkel (text), providing great functionality to assist users in this task. For Continual users who are focused on tabular data in their cloud data warehouse, prediction engineering generally amounts to simply constructing the right SQL query to properly provide labels for a model definition. Like feature creation, we think that this is something that should largely be tackled via human grit and there is often significant business context that is needed to properly set labels correctly.

Automated Feature Engineering

Our last section here is definitely the least explored and most tantalizing: automated feature engineering. Automated feature engineering involves the automation of feature creation, and can additionally include feature selection and feature encoding.

While many features may require human interaction to construct, some can be automated away. For example, any time we wish to use a date or timestamp as a feature, we could automatically break it into its component parts: year, month, day, day of week, hour, “is the day a holiday?”, etc. Many use cases require calculating window functions on top of existing features. Constructing oodles of rolling windows can be time-consuming and error-prone, but it’s not hard to imagine that users could simply specify a partition, window frame, and aggregation function to an existing feature to define a new feature (i.e. apply avg to order_total using the last month of data and group by user to produce customer_one_month_order_average). Depending on how well one’s data is stored and how well the underlying data warehouse supports window operations, these operations can become non-trivial, and automating their construction can save a lot of time and effort.

Of course, more complex systems for automated feature engineering exist. Two that are popular are FeatureTools and tsFresh. FeatureTools was designed to work on tabular data and tsFresh for time series data. Both are able to easily generate many features from raw data with a little input from the user. FeatureTools, for example, utilizes a system of feature creation known as Deep Feature Synthesis (DFS). DFS operates under the principle of stacking operations on top of each other to create new features. For example, if one of my features sets is sales data, for each order I may calculate a new feature, days_since_last_transaction, which gives me the number of days between the current transaction and the last. I could then calculate a rolling average on my new feature to give me yet another feature, average_days_since_last_transctions_over_one_year, and so on. With each additional operation, we are “stacking” a new operation on the feature to create a brand new feature. For any given use case, we may wish to do this with many different features, and the insight of DFS is that these operations are pretty common and can be quickly automated with a little configuration.

In the example above, we can continue to build additional features by stacking operations on top of each other.

The field of automated feature engineering is still very young. We have high hopes for the development of automated feature engineering and believe that advancements will continue to bring it closer to the mainstream. Similar to feature encoding and feature extraction, automated feature engineering will be best implemented in a Gen 3 system via a declarative system. It’s ideal to abstract the messy details away from the end-user and allow them to simply define the desired operations via configuration. While Continual offers some light automated feature engineering today, we’re currently evaluating more complex frameworks like DFS for future inclusion in the tool.

Summary

We’ve provided a lot of information in this article about how the approach to feature engineering changes in a Gen 3 ML platform like Continual. The following diagram summarizes where we envision different roles coming into play in the traditional ML workflow, with feature engineering expanded upon per the context provided above. We’ve also contrasted this with earlier iterations of ML platforms in Gen 1 and 2. Use this as a guide as you chart out your own organization’s journey with Machine Learning and AI.

Ready to sink your teeth into data-first AI? Sign up for your free trial of Continual today.

--

--

Jordan Volz

Jordan primarily writes about AI, ML, and technology. Sometimes with a humorous slant. Opinions here are his own.