Beyond Big Data Gathering: Creating Value from Feature Engineering
Gone are the days where gathering and organizing vast lakes of data was enough to create value for organizations. Financial institutions in particular have captured significant amounts of data for years but are failing to truly use this asset in ways that deliver value for their organizations and customers.
by Pejman Makhfi, CTO, Credit Sesame
Some financial services companies are progressing with better organization and architecture of that data, but the real edge will come from a firm’s ability to transform this information into proprietary insights through increasingly sophisticated feature engineering techniques.
Looking Past the Algorithm
A Gartner research executive as early as 2015 commented on the importance of data transformation. He identified a coming “algorithmic economy” and said, “Organizations will be valued not just on their big data but by the algorithms that turn data into actions.”
Subsequent research from Gartner notes that patents mentioning “algorithm” in the title increased thirty fold over the period between 2000 and 2015, from 570 to 17,000. Interestingly, U.S. financial institutions hold enormous amounts of data assets with the potential to create value, but none were among the top 40 algorithm patent applicants in the five year period between 2011–2016. Most are Chinese companies and universities. Western companies in the top 40 are all technology organizations, such as IBM, Microsoft, Google and Apple.
While these unique algorithms are certainly important, they are just one element of overall transformation of the data. More emphasis needs to be placed by teams on the underlying engineering of features as inputs to these models to ensure that the algorithms yield insights and perform accurately and reliably.
The idea of feature engineering is as old as data science itself, but teams have focused less on this area in recent years. The high demand for machine learning and backlog of projects require teams to work quickly and take advantage of generic tools and platforms. Now, even a company’s least experienced engineers can run data models and get predictions quickly thanks to off the shelf algorithms in cloud-based Machine Learning Services (MLaaS) like Amazon ML and Google AutoML. With these support systems in place, it is time to look more closely at techniques for feature engineering.
Best Practices for Feature Engineering
Data scientists have developed a range of techniques to power feature engineering and help ensure the reliability and consistency of machine learning results. It is important for an organization’s leaders to understand and make sure their teams are employing these methods.
Three categories of techniques are critical to keep in mind for a team are:
An important approach to feature engineering involves using arithmetic formulas across a set of existing data points. The formulas result in derivatives based on interactions between features, ratios and other relationships.
As long as the data team has a strong understanding of the goals of the model and the overall subject matter, use of arithmetic formulas to engineer features can often deliver the biggest impact from among the various techniques.
Examples include using formulas to:
- Calculate “Estimated Value” for a home using an average of “Comparable Sales” by “Square Footage”
- Produce DTI by calculating ratio of “Credit Payments” to “Current Income”
- Derive a “Retirement Gap” by calculating the “Future Value” of existing assets and comparing to “Current Income”
Transforming for Context
In this group of techniques, data engineers convert individual features from the original set into more meaningful information based on the context of each specific model.
For example, in the case of a categorical feature, the value unknown might have a specific meaning in the context of a particular situation. However, when running a model, it may appear to be just another category value. To correct for this issue, a team might add a binary feature of has_value to differentiate unknown from the other options. For example, the “product” feature will turn into product and has_product.
A similar technique would be to convert a categorical feature into a set of variables using one-hot encoding. In the above example, turning the “product” category into three features rather than just a single “Saving,” “Checking,” or “Investment” entry may improve the learning process depending on the goals of the model.
Another commonly used tool is Binning, which is used in machine learning for transforming single features. For example grouping “location” information into regions such as east for addresses in states in the Eastern U.S. time zone, central for those in the Central U.S. time zone and so on.
Some other examples of transformations are:
- Scaling values between min-max of a variable such as age in the dataset into a range of [0, 1]
- Examining the number of purchases in particular types of retail stores as an indicator of “interest” in certain consumer goods
More Complex Techniques
Another option would be for teams to use more advanced algorithmic methods to create new and substitute features based on existing ones.
- Principal Component Analysis (PCA) and Independent Component Analysis (ICA) map existing data to another feature space
- Deep Feature Synthesis (DFS) allows for transfer of intermediate learnings from middle layers in the Neural Networks
Key Steps for Success
The best data engineers are always experimenting with new feature engineering techniques and trying new approaches. The key to success in this environment is having a methodical and repeatable framework in place to guide these efforts. I see six critical steps for ensuring the success while a team works to keep its feature engineering fresh:
Step 1: Know the objectives
Any feature engineering effort must begin with the entire team being clear on the primary objectives and use cases of the model. This creates a singularity of purpose that ensures efforts will not be diluted and resources will be used effectively.
Step 2: Establish guardrails
Real world barriers such as cost, accessibility, computational limits, storage constraints or other requirements are a reality and must be considered when creating a featurization plan. Before launching into training, the team must carefully explore available data, identify the issues and agree on how they will accommodate these types of constraints.
Step 3: Brainstorm
The third step requires the team to open their minds to think creatively. They must identify and consider new ways to create data and better solve the problem at hand. At this point, significant value can be added by having partners with deep domain knowledge and subject matter expertise involved.
Step 4: Select the best technique
Next, the team must pick the most effective technique for constructing their new feature concepts. This step is pivotal for making sure the new features deliver the reliability, accuracy and value envisioned.
Step 5: Assess the outcome
With the new features in place, the team now must consider how they impact model performance. To appropriately determine model efficacy, performance must be evaluated using criteria that are meaningful to the business’ objectives. Options today allow teams to consider metrics beyond basic accuracy and can include precision, recall, F1 score and the receiver operating characteristic (ROC) curve. Selecting the best measure for assessing performance should be done in partnership with business leaders and domain experts.
Step 6: Evolve the features
Teams must follow an iterative process in their feature engineering work. This means optimizing the feature set through continuous testing, adjusting and refining. Through this effort the highest impact features are identified while low performing features are replaced with close variants or eliminated.
We’re moving into a post-big data world in which companies create value through transformative work not just the collection and organization of information.
FinTechs have long led the way in these efforts in the financial services sector. Now traditional institutions must also shift their focus to putting data to use. Generic, third-party machine learning platforms can serve as foundational resources, but teams must keep feature engineering efforts as a core competency to ensure their work is truly creating value.
Pejman Makhfi (@pmakhfi), CTO of Credit Sesame, is a Silicon Valley tech veteran with more than 20 years of experience and has been involved in fintech, AI and machine learning projects since 2001. Prior to Credit Sesame, Pejman was VP of Solutions for Savvion, CTO for FinancialCrossing, and COO at The Enterprise Network. He has held positions on boards such as the Machine Intelligence Research Institute (MIRI), formerly the Singularity Institute, and the Lifeboat Foundation, a futuristics/AI governance organization. He holds a B.S./M.S. degree in Computer Science from Dortmund University.