The Art of Feature Engineering: A Distributed Approach

8 min readSep 11, 2023

By Sindhu Chowdary Chirumamilla, Evgeniya Dontsova, and Ankita Patil, Members of Scientific Staff, DISH Wireless

The foundation of any data-driven project lies in data discovery and collection. A clean, complete, and digitized dataset is a prerequisite for developing numerous AI/ML models that extract valuable business insights from the available data. However, challenges emerge as data is often scattered across various sources, handled by different teams, and analyzed by multiple parties. Additionally, projects involving a large team of developers pose even more issues to address on how to efficiently enable simultaneous contributions. In addition to data discovery and collection, feature engineering is a critical aspect of every data-driven project. It plays a pivotal role in extracting meaningful insights from raw data, influencing the performance of models, and guiding data-driven decision-making. But how can feature engineering be effectively managed in a collaborative environment when dealing with diverse data sources?

DISH Wireless developed a mechanism to manage feature engineering with multiple data sources by building a feature pipeline that simplifies working with distributed data sources and enhances the feature building process by converting from the traditional monolithic SQL approach to a distributed code architecture. For our use case, this feature pipeline generates telecom subscriber attributes from different domains, including call detail records, account management and customer experience/care. These data points help in building subscriber profiles that we further used to perform machine learning-based subscriber segmentation analysis and personas discovery.

Feature Engineering for Machine Learning

A paradigm shift in the AI/ML community — from model-centric to data-centric development — has increased focus on data management by designing well-structured, reusable and shareable solutions across different teams. This led to the introduction of the feature store concept, pioneered by Uber in 2017. (See the milestones summary here: Feature Store Milestones.)

What is a feature store and how does it differ from a data warehouse? In short, a feature store is a data warehouse of features for machine learning that has offline (for training purposes) and online (for inference) data storing capabilities (see, e.g. What Is a Feature Store? | Tecton and Hopsworks — What is a Feature Store?). Typically, a feature store contains three components: transform, store, and serve. There are numerous solutions available that both differ by implementation and focus on different components. For example, Feast is an open source feature store, managed by the Linux Foundation, that provides store and serve components.

Several organizations, like Twitter and Salesforce, have customized Feast to build their own feature store products.

The feature store is undoubtedly an infrastructure investment, and many companies opt to develop their own solution. This is particularly true for heavily regulated industries, such as finance, which deal with sensitive data. In certain situations, the time required to onboard an existing solution becomes comparable to the time needed to build and establish their own feature store, because certain situations involve complex data setups, strict compliance needs, unique workflows, or challenging data migration.

Our team focused on the transform aspect of the feature store, or the feature engineering pipeline. Specifically, our feature engineering pipeline is capable of facilitating efficient data preprocessing, feature extraction, and transformation. Additionally, our pipeline can be extended by integrating it with feature store solutions that specialize in store and serve components, supporting collaboration and reusability across different domains.

Feature Engineering Pipeline

Our feature engineering pipeline utilizes a distributed code architecture in Python, providing data scientists with the means to effectively address the constraints of a monolithic SQL-based approach. This approach is particularly valuable for incorporating Python processing functionality and handling diverse data sources efficiently, rendering it a valuable asset in any data science ecosystem. The feature engineering component is composed of two layers, (shown in Fig. 1) the data ingestion SQL layer retrieves raw data, while the data processing Python layer enables the engineering of complex features and the integration of machine learning techniques within feature generation. This transformation enhances flexibility and efficiency in creating advanced features, allowing us to build richer datasets that can boost analytics capabilities and improve modeling outcomes.

Our feature engineering pipeline operates by following a sequence of well-defined steps, ensuring extraction of the most pertinent and informative features from the data. As shown in Fig. 1, the pipeline schematics visualizes interactions among the data sampling, feature inputs, utilities, and generate-subscriber-attributes modules. Here’s how it works:

The process begins from the right side of Fig. 1, where features are extracted from distributed data sources using the SQL layer and then transformed into a format suitable for additional processing using the Python layer. Afterward, these transformed features are fed as inputs to the Feature Inputs module. Since we are handling vast volumes of data originating from various distributed sources, we implemented a sampling technique to obtain a representative subset of the entire dataset. These selected samples were collected in JSON format and utilized as inputs to the Feature Inputs by feeding them into the sampling configuration file. This pipeline process provides a convenient way of testing various samples.

The Feature Inputs module is organized by distinct input scripts broken down by telecom categories mentioned before, which can be modified based on your use case. Utilities module handles the core functionality of the transformation process. This module consists of reusable functions that significantly enhance the efficiency of feature transformation. All features from various categories are integrated within the Feature Inputs configuration file, along with a SQL query to create a base ID table, which is used to merge all the transformed features at a later stage. Subsequently, the feature inputs and the base ID table retrieved from configuration are employed to construct features. Utilizing the modular structure of the feature-building units, features are generated in parallel and then merged with the base ID table in a sequential manner. The ultimate outcome of this pipeline is a complete digitized features table suitable for further machine learning use cases, as shown in Fig. 2.

Figure 2: Schematic representation of inputs and outputs required by the presented Feature Engineering Pipeline.

Our pipeline incorporates several key components that contribute to its efficiency and success. Alongside the central feature inputs, the utilities folder streamlines the feature engineering process by providing essential tools and functions, reducing redundant code and accelerating development. Data sampling techniques enable efficient and insightful development by focusing on specific groups and targeted data subsets. Furthermore, a distinctive aspect of the pipeline is its capability to engineer features concurrently, making use of modular feature building units and parallel computing to decrease feature generation time and improve overall development performance.

As we dig deeper into the core of our feature engineering pipeline, it becomes apparent that the Feature Inputs plays a central role in its effectiveness by systematically organizing diverse domains. This well-structured arrangement helps developers to collaborate, share, update, and reuse feature engineering efforts across multiple projects.

Design Principles

Following are the key principles of our feature engineering pipeline:

Modularity — One Feature at a Time

Modularity is at the core of our pipeline’s design. We achieved this by adopting a feature engineering approach that involves building one feature at a time and splitting the processing into two distinct layers. The data ingestion layer is optimized with SQL queries, ensuring that any standardization or sampling rules are automatically applied. Plus, the Python-based data processing layer offers greater functionality.

Simplicity- Divide and Conquer Paradigm

Separating the ingestion (SQL) and processing (Python) layers allows for the design of simpler, more readable SQL queries. By decomposing the Python processing layer to address smaller problems and distributing it across different parts of the codebase, such as the feature inputs and utilities modules, we improved script readability and ease of maintenance and upgrading.

Parallelism — Power of Parallel Computing

Parallelism is a critical aspect of our feature engineering approach. The modular nature of feature building units allows for the parallel construction of features. This capability reduces processing time significantly, making our pipeline an ideal choice for large-scale feature engineering tasks.

Flexibility — Paving the Way for Sophistication

Python’s inherent flexibility empowers data scientists to engineer more complex features, even integrating machine learning techniques during feature generation. This adaptability widens the scope of feature engineering possibilities, leading to more innovative and impactful models.

Templatization — Standardized Feature Inputs

Our pipeline adopts a standardized representation of engineered features, ensuring consistency and compatibility with modern feature stores. The features generated by our pipeline are stored in a format that is easily consumable by various feature store platforms.

Benefits of a Distributed Development Approach

By incorporating the key design principles described above we achieved the following benefits while using the developed feature pipeline:

Feature Sharing and Collaboration

By integrating with feature inputs, data scientists can easily share engineered features with other team members. Our pipeline enables feature sharing with proper access controls, allowing multiple users to collaborate on feature engineering tasks without duplicating efforts.

Cross-Project Reusability

One of the key advantages of our transform part integration with modern feature stores is cross-project reusability. Once features are engineered and stored in the Feature Inputs, they can be reused across different projects and machine learning models. Not only does this save time, but it also ensures consistency in feature engineering across various use cases.

Incremental Feature Updates

As data changes over time, the pipeline supports incremental updates to existing features stored in the Feature Inputs. This allows data scientists to keep features up-to-date without re-engineering them entirely, saving computational resources and ensuring accurate insights.

Conclusion

From the DISH Wireless team’s experience, following the design principles described above, while developing a feature engineering layer, made collaboration on feature building processes more straightforward and efficient. It allowed us to expand our subscriber attributes base and extract valuable insights in a timely manner. Although the feature engineering process is typically specific to each industry, the main principles and guidelines outlined above can help any data scientist implement a similar feature pipeline layer in their unique settings.

About the Authors

Evgeniya Dontsova is a Staff Data Scientist at DISH Wireless. She is a part of the R&D team whose primary focus is on solving emerging problems caused by the new era of connectivity facilitated by 5G network technology. Network optimization and network user experience assessment are her primary interests. Her prior experiences include solving optimization problems and applying machine learning modeling in the oil and gas industry, and academic research in the area of computational modeling of materials.

Sindhu Chowdary Chirumamilla is a Data Scientist at DISH Wireless. She possesses strong passion in areas of Machine Learning to extract insights from data and optimizing Cloud Technology solutions for efficient computing infrastructure. With over four years of IT experience, she harnesses her expertise to develop innovative solutions for creating marketplace applications and refining its connectivity using 5G network technology. She also holds prior teaching experience in academia, particularly in the field of computer science.

Ankita Patil is a Data Scientist at DISH Wireless and a former Electrical Design Engineer. Her passion lies at the intersection of Analytics, Data Science, and Machine Learning. She has contributed significantly to building enterprise-level data products for DISH’s 5G Network Technology. Through her blog post, she intends to share insights about feature engineering pipelines by offering readers a valuable perspective on transforming raw data into meaningful insights.