Implementation of Asset Exposure Score Algorithms at Scale with Snowflake

Angela Kruger
6 min readAug 8, 2024

--

In cybersecurity, understanding the concept of an asset is fundamental. Whether it’s a tangible computing device or an abstract identity, assets form the backbone of risk assessment. Tenable’s approach to Asset Criticality Rating (ACR) delves into the intrinsic properties of assets, particularly focusing on identity assets, to determine their importance within an organization.

Part 1 of this blog series dissected the components of ACR for identity assets, shedding light on how hierarchy and entitlement factors play a crucial role in defining an individual’s significance from a cybersecurity standpoint.

In part 2, I will be covering how the AES algorithms are run on a scheduled basis using Snowflake Tasks and Stored Procedures.

  • For context on Tenable’s journey in modernizing their data stack, read this blog. It details why we chose to build on Snowflake to support Tenable’s expanding data infrastructure needs while maintaining focus on delivering customer value.

This next section will detail how we implement the scoring algorithms into our automated workflow.

Implementation

Storing and accessing custom trained model files

Because our ML models are custom models pre-trained by our data science team, we need to host these externally and load them whenever required. We do this by creating an external stage in Snowflake which points to a folder within an S3 bucket in the same AWS account.

SHOW STAGES IN DATABASE snowflake_dev;
SELECT “created_on”, “schema_name”, “url”, “region”, “type”, “cloud”
FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));

Once the external stage has been created, we can store custom trained models as well as 3rd party frameworks and packages.

In addition to the ability to store custom files, Snowflake has a number of python packages pre-loaded through Snowpark. A simple query can tell you whether the packages you require (and the versions thereof) are already loaded in snowflake or would need to be accessed elsewhere (such as S3 external stage)

SELECT * FROM information_schema.packages
WHERE language = ‘python’ AND runtime_version = ‘3.8’
QUALIFY ROW_NUMBER() OVER(PARTITION BY package_name, language, runtime_version ORDER BY version DESC) = 1 — latest
ORDER BY package_name, runtime_version, version;

All of these can be imported into python UDFs and stored procedures.

Python UDF

Now let’s take a look at our python UDF. First we need to import our custom models, as well as some built-in packages:

CREATE OR REPLACE FUNCTION base_transform.score_job_title_demo(job_title VARCHAR)
RETURNS ARRAY
LANGUAGE python
RUNTIME_VERSION = 3.8
IMPORTS = (
‘@ingest.models/sentence_transformers.zip’,
‘@ingest.models/all-MiniLM-L6-v2.zip’,
‘@ingest.models/job_title_models/job_title_model.pkl’)
PACKAGES = (‘numpy’, ‘pytorch’, ‘nltk’, ‘requests’, ‘cloudpickle’, ‘transformers’, ‘scikit-learn’, ‘Pillow’)
HANDLER = ‘ScoreJobTitlesDemo’
AS
$$
import sys, threading, fcntl, zipfile, os
import cloudpickle, torch

import numpy as np
from sklearn.metrics.pairwise import cosine_distances

# Now we go ahead and load the model files

def ScoreJobTitlesDemo(title: str):
if not title:
return None

try:
encoded = model.encode(title).reshape(1, -1)
score = get_mean_cosine_scores(encoded, jt_embeddings, scaled_jt_scores)
js_score = get_mean_cosine_scores(encoded, js_embeddings, scaled_js_scores)
score[:, 0] = score[:, 0] * js_score
return score.squeeze(axis = 0)

except Exception as err:
score = [“error: {0}”.format(err)]

return score
$$;

Now let’s pass in some data to our function!

CREATE OR REPLACE TEMPORARY TABLE tmp_job_titles (job_title VARCHAR, rowid INT AUTOINCREMENT);
INSERT OVERWRITE INTO tmp_job_titles(job_title)
SELECT ‘Junior SQL Developer’
UNION ALL SELECT ‘Senior SQL Developer’
UNION ALL SELECT ‘DBA’
UNION ALL SELECT ‘Database Administrator’
UNION ALL SELECT ‘CEO’
UNION ALL SELECT ‘Astronaut’
UNION ALL SELECT ‘Circus Clown’
UNION ALL SELECT ‘Circus Ring Master’;

SELECT job_title, base_transform.score_job_title_demo(job_title) AS job_title_score
FROM tmp_job_titles
ORDER BY rowid;

And the results:

We can see here that the predictions aren’t always 100% spot-on, but we can also see that when you throw wild titles at it, that it does make an attempt to assign a risk score (circus clown’s aren’t a big risk to our data, it turns out!)

And this rather simple query took 15 seconds to run because the large model files have to be loaded into memory

Vectorized UDFs

The next steps will be to make this a vectorized UDF. This will allow the UDF to receive batches of input rows in the form of Pandas DataFrames, instead of running RBAR. Results will be returned as Pandas arrays or Series.

For a simple UDF (e.g. integer inputs, perform calculation, integer output), vectorizing is as simple as adding a decorator or a function attribute to the function.

BUT!

There are catches for more complicated functions like ours. Data types, number of inputs and outputs, conversions between SQL datatypes and Pandas — all of these things can prove to be potential problems that will need to be overcome. And this is currently something we are working on in Tenable, as a way to improve performance on our Asset Criticality scoring algorithms (watch this space for updates on how we were able to solve this in our own implementation, which is very much NOT a simple UDF)

Snowpark Tasks and SQL Stored Procedures

As for the final implementation, that’s the easy bit. Creating the Python UDF is the tricky part, and from here on out we can call the UDF as we call any other UDF or function — inline in our code (see above demo).

Looking back at the Hierarchy Component, you’ll see that the final score is made up of a number of different elements, of which job title ratings are just 1 part. To keep things clean and simple, we’ve broken each element down into separate TSQL stored procedures, wrapped in tasks, and strung these tasks together in a task graph, or a DAG, which ensures that each step of the process is executed (and completed) in order, before moving on to the next step in the process.

And because every stored procedure is wrapped in a separate task, we can monitor each task individually for execution time and stats, and check for errors

Looking Ahead…

Where will we be going from here?

Our next steps will be looking at using the new Cortex feature for POC’ing and rolling out our ML models. This will provide some advantages such as quicker and easier transition from Data Science team’s POCs to the implementation into our ETLs.

We will also be delving into vectorizing our python UDFs, as we anticipate this to bring a significant improvement to performance of our UDFs.

Conclusion

Overall, we’ve had great success so far with bringing in ML into our scoring algorithms. We’ve improved the accuracy of our scores over using simple regex, we’ve improved performance of our ETLs, and we’re able to provide our customers with better quality information.

--

--