Feature Benchmarking using AWS Sagemaker Pipeline and Sagemaker Feature Store

Onur Yigit Arpali
Picus Security Engineering
6 min readJun 14, 2023

The intricate process of crafting machine learning models typically comprises two central stages: feature engineering and hyper-parameter optimization. Both of these stages play a pivotal role in shaping high-performing models that can effectively harness the value embedded within the data. In Picus Security, to enrich our complete security platform with ML solutions, we have conceived a two-stage pipeline to implement these processes choosing AWS SageMaker Pipelines and Feature Store products as the infrastructure.

In this blog post, we will delve into the first stage — the feature benchmarking pipeline — and discuss the structure we have devised to bolster our model development. Feature benchmarking is a novel approach that operated with feature engineering processes in Picus Security.

Feature engineering, an essential component of machine learning, entails selecting, transforming, or creating features (input variables) to be employed by algorithms for making predictions. The quality and relevance of these features are instrumental in determining a model’s overall performance. Regardless of the level of sophistication inherent in a machine learning algorithm, its effectiveness is fundamentally contingent upon the features it relies upon. As a result, allocating time and resources to feature engineering is of paramount importance in developing models.

On the other hand, hyper-parameter optimization is a systematic approach to searching and selecting the best combination of parameters that govern the behavior of a machine learning algorithm. They are not learned from the data during the training process but are instead predefined or adjusted manually. This process aims to maximize the model’s performance on a given dataset by fine-tuning the parameters, ultimately resulting in better generalization and predictive accuracy.

Why and How Benchmark the Features?

The process of hyper-parameter optimization is often resource-intensive and time-consuming, making it costly to develop models using every extracted feature. Besides different team members may extract various features, which further increases the cost and complexity of feeding every extracted features into the model development process. To mitigate these issues, it is essential to identify and prioritize the most promising features for use in the subsequent model development stage.

In order to achieve this, we first determine an algorithm, having the ability to solve the problem’s objective like classification, with the relevant parameters and select values that are likely to result in overfitting. The idea behind using overfitting as a metric for feature quality is simple: if a model can fit the training data well, even to the point of overfitting, it’s likely that the features capture some valuable information. By training a model on each set of features separately and evaluating the performance on a validation set, we can identify which features allow the model to fit the data well, potentially indicating more promising features for further model development.

However, it’s important to note that using an overfitted model to evaluate features can introduce some risks. Features that lead to overfitting might capture noise rather than the true signal in the data. Therefore, it’s essential to perform additional feature evaluation and model validation techniques to ensure that the selected features can generalize well to unseen data.

Feature Extraction and Benchmarking Flow

Feature Benchmarking Pipeline

The products namely SageMaker Feature Store and SageMaker Pipeline are the main two components of the Flow.

SageMaker Feature Store

SageMaker Feature Store is a decent product which comes with some limitations. Due to its seamless integration with our tech stack at Picus Security and cost-effectiveness, we opted to use a product within the AWS ecosystem.

Each feature group can have 2500 feature definitions. Features, which are transformer outputs from pre-trained models like the most of our cases, may hit this limit. All feature groups have an event time feature following the patterns :[yyyy-MM-dd’T’HH:mm:ssZ, yyyy-MM-dd’T’HH:mm:ss.SSSZ].

To efficiently manage these extracted features, we utilize a feature store as a centralized repository. Each developer extracts features from raw data, which is sourced from various channels depending on the project’s specific requirements. Feature stores are specifically designed to handle the unique demands of machine learning workflows, offering key advantages over traditional databases, such as improved performance, versioning capabilities, and seamless integration with machine learning platforms. By employing a feature store, we can effectively store, manage, and share features across different machine learning models, allowing our team to efficiently access and compare multiple features without manual intervention.

SageMaker Pipeline

An Amazon SageMaker Model Building Pipelines pipeline comprises a sequence of interconnected steps, defined using the Pipelines SDK. This pipeline definition represents a directed acyclic graph (DAG). The DAG provides insights into the prerequisites and relationships among each step within the pipeline. The structure of a pipeline’s DAG is shaped by the data dependencies between the steps.

Using XGBoost as a Baseline Model:

Using XGBoost as a baseline model can provide evidence that the input features are separable in a high-dimensional space. XGBoost, as a tree-based model, is capable of modeling non-linear relationships between features and the target variable. By constructing decision trees, the algorithm attempts to split the data into distinct groups to minimize the loss function iteratively.

If an XGBoost model achieves a good performance on your dataset, it demonstrates that the model can effectively separate the data points in the high-dimensional feature space. This result can be a useful indicator that the features generated by the pre-trained transformers are informative and capture meaningful patterns.

Each ingested feature group to the feature store is benchmarked in scoring pipeline and the results are served via AWS Athena query engine. The image above shows the feature benchmark pipeline dag which consists 3 interconnected steps. Step relations are decided by data dependencies which indicate the required data before the related step is executed.

Each step in the pipeline executes different script. Based on their respective operations, processor types of steps are varied. In the pipeline above, XGBModelTraining step is a TrainingStep and other two are ProcessingStep. The three-step pipeline is triggered with 4 parameters that are feature group name, feature group table name in Athena, the target column to use in XGBoost and AWS region name.

pipeline = Pipeline(
name=pipeline_name,
parameters=[feature_store_table_name_param,
feature_store_table_target_column_param,
region_name_param,
feature_group_name_param
],
steps=[
process_step,
train_step,
cleaning_step
],
sagemaker_session=sess
)

GetFeature step basically gets the features from specified feature group table. Preprocess operations are executed "get_features.py" such as train, validation separation. The processed data is stored in a defined location to use later by next steps. Each step is run with different instance types regarding the performance concerns.

script_processor = FrameworkProcessor(
role=sagemaker_role,
instance_count=process_instance_count,
instance_type=process_instance_type,
estimator_cls=est_cls,
framework_version=framework_version_str,
)
step_args = script_processor.get_run_args(
code="get_features.py",
source_dir="./get-features",
outputs=[
ProcessingOutput(destination=output_train, output_name="train_data", source="/opt/ml/processing/train_data"),
ProcessingOutput(destination=output_validation, output_name="validation_data",
source="/opt/ml/processing/validation_data"),
],
arguments=[
"--table-name", feature_store_table_name_param,
"--target-column", feature_store_table_target_column_param,
"--region-name", region_name_param
]
)
# Define pipeline processing step
process_step = ProcessingStep(
name="GetFeatures",
processor=script_processor,
inputs=step_args.inputs,
outputs=step_args.outputs,
job_arguments=step_args.arguments,
code=step_args.code
)

In training step, XGBoost estimator is used with a fixed set of parameters. To enable overfitting, max_depth, eta, objective, num_class parameters are set to specific values. TrainingStep inputs are defined as previous steps output to set dependency between steps. TrainingStep run "xgboost-model.py" inside the instance. Basically, the script parses the arguments, runs the XGBoost estimator, stores the results and finalizes the step. In terms of the project objectives, accuracy and precision metrics are used as indicator to benchmark the features.

# construct a SageMaker XGBoost estimator
# specify the entry_point to your xgboost training script
xgb_estimator = XGBoost(entry_point="xgboost-model.py",
source_dir='./xgboost-model',
framework_version=xgboost_version,
hyperparameters=hyperparameters,
role=sagemaker_role,
instance_count=xgboost_instance_count,
instance_type=xgboost_instance_type)
s3_input_train = TrainingInput(process_step.properties.ProcessingOutputConfig.Outputs["train_data"].S3Output.S3Uri)
s3_input_validation = TrainingInput(
process_step.properties.ProcessingOutputConfig.Outputs["validation_data"].S3Output.S3Uri)
# Set pipeline training step
train_step = TrainingStep(
name="XGBModelTraining",
estimator=xgb_estimator,
inputs={
"train": s3_input_train, # Train channel
"validation": s3_input_validation # Validation channel
}
)

Final step in the pipeline, delete processed features used by XGBoost estimator. This step has two dependencies. Data dependency is defined via input arguments. The input is defined as first steps output. Since it cleans the processed features, it needs to wait training step.

delete_train_and_validation_data_processor = FrameworkProcessor(
role=sagemaker_role,
instance_count=process_instance_count,
instance_type=process_instance_type,
estimator_cls=est_cls,
framework_version=framework_version_str,
)
cleanin_step_args = delete_train_and_validation_data_processor.get_run_args(
code="delete_train_val_data.py",
source_dir="./delete-train-val-data",
arguments=[
"--train_dataset", process_step.properties.ProcessingOutputConfig.Outputs["train_data"].S3Output.S3Uri,
"--validation_dataset", process_step.properties.ProcessingOutputConfig.Outputs["validation_data"].S3Output.S3Uri
]
)
# Define pipeline processing step
cleaning_step = ProcessingStep(
name="DeleteTrainValData",
processor=delete_train_and_validation_data_processor,
inputs=cleanin_step_args.inputs,
outputs=cleanin_step_args.outputs,
job_arguments=cleanin_step_args.arguments,
code=cleanin_step_args.code
)
cleaning_step.add_depends_on([train_step])

In conclusion, we have discussed the importance of feature engineering and the novel benchmarking approach we employ to prioritize and evaluate features efficiently. In Picus Security, we place great emphasis on intelligent solutions in cybersecurity, striving to push the boundaries of what’s possible in order to create more effective and robust solutions. As part of our commitment to continuous improvement, we have developed a two-stage pipeline that enhances our model development process. In the next blog post, we will explore the second pipeline and how it contributes to the overall success of our project development processes in Picus Security. Stay tuned for more insights and discoveries in our upcoming discussions!

Would like to get in touch? Reach me on LinkedIn:

https://www.linkedin.com/in/arpali/

--

--