Fidelity Optimizes Feature Engineering With Snowpark ML

For the past few years, Fidelity Investments has been moving a significant percentage of its applications to a cloud-based infrastructure. As part of that transition, Fidelity has consolidated its analytics data into its Enterprise Analytics Platform, which is engineered using the Snowflake Data Cloud, making it easier for teams and departments across the company to access the data they need.

Fidelity’s data scientists use the AI/ML-enabled Enterprise Analytics Platform to process a large volume of structured and semi-structured data for deeper insights and better decision-making. Historically, the platform was housed in physical servers. In 2020, Fidelity kicked off its digital transformation and established an Enterprise Data Lake (EDL) along with Data Labs. Recently, Fidelity wanted to conduct parallel analytics in the cloud and decided to use Snowpark and Snowpark ML.

Fidelity has two main enterprise data architecture guiding principles for their data scientists and data engineers:

For data storage, Snowflake is the platform for storing all of the company’s structured and semi-structured analytical data in its Enterprise Data Lake and Data Labs. All of Fidelity’s storage abides by its data security framework and data governance policies and provides a holistic approach to metadata management.

For compute, Fidelity’s principles are to minimize the transfer of data across networks, avoid duplication of data, and process the data in the database — bringing the compute to the data where possible.

Feature engineering in focus

Fidelity creates and transforms features to improve the performance of its ML models. Some common feature engineering techniques include encoding, data scaling and correlation analysis.

Fidelity data scientists were running into computation pain points especially around feature engineering. Feature engineering is a period in the data science process — before refinement and after expansion and encoding — where data can be at its peak size. Pandas DataFrames offer a flexible data structure to manipulate various types of data and apply a wealth of computations. However, the trade off of Pandas DataFrames is the restriction of memory — including the size of the DataFrame in memory and the expansion of memory due to the space complexity of the computation being applied to the data. This was exacerbated by the speed of single-node processing, where memory contention and distribution of work had limited resources. The team also considered Spark ML purely for the flexibility of distributed processing, but Spark involves complex configuration and tasks and required maintenance overhead for both the hardware and software. Fidelity wanted to leverage capabilities like parallel processing without the complexity of Spark so they turned to Snowpark ML.

Feature engineering using Snowpark ML

Snowpark ML includes the Python library and underlying infrastructure for end-to-end ML workflows in Snowflake. Fidelity decided to work with the Snowpark ML Modeling API for feature engineering because of the improved performance and scalability with distributed execution for common Sklearn-style preprocessing functions. There are a number of benefits to this:

● All the computation is done within Snowflake, enabling in-database processing

● Handles large data volumes

● Scales both vertically and horizontally

● Correlation and preprocessing computation scales linearly size of data to Snowlake standard WH size.

● Data is not duplicated nor transferred across the network

● Leverages the extensive RBAC controls, enabling tightly managed security

● Lazy evaluation avoids unnecessary computation and data transfer, and improves memory management

● Simple to use

Benefits with Snowpark ML

Fidelity compared Snowpark ML for 3 different scenarios: MinMax Scaling, One-Hot Encoding, and Pearson Correlation.

MinMax scaling is a critical preprocessing step to get our data ready for modeling. For numerical values, we want to scale our data into a fixed range between 0 and 1. With Pandas, the performance is fine for small datasets but it does not scale to large datasets with thousands or millions of rows. Snowpark ML eliminates all data movement and scales out execution for much better performance.

Figure 1. Performance improvement of 77x with Snowpark ML, compared to in-memory processing for MinMax scaling.

One-hot encoding is a feature transformation technique for categorical values. With Snowpark ML, the execution was much faster by leveraging the distributed parallel processing for the data transformation and eliminating the data read and write times.

Figure 2. Performance improvement of 50x with Snowpark ML, compared to in-memory processing for one-hot encoding.

By using Snowpark ML to derive Pearson Product Moment or Pearson Correlation Matrix, Fidelity achieved a magnitude of performance improvement by scaling the computation both vertically and horizontally. This is especially useful for use cases with large and wide data sets in which there are, for example, 29 million rows and over 4,000 columns.

Figure 3. Performance improvement of 17x with Snowpark ML, compared to in-memory processing for Pearson correlation.

Fidelity has achieved significant time, performance and cost benefits by bringing the compute closer to the data and increasing the capacity to handle more load. By speeding up computations, our data scientists now iterate on features faster. Those time savings have allowed the team to become more innovative with feature engineering, explore new and different algorithms, and improve model performance.

I would like to express my gratitude to my co-authors Linda Devenney and Melinda Hamel-Graziano. Their invaluable insights and tireless efforts have greatly contributed to the completion of this work.

--

--

Ray Zhang
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Ray Zhang serves as the VP of Machine Learning Engineering Practice Lead at the Fidelity Institutional AI Center of Excellence.