Using Native Apps to Provide Collaborative Machine Learning

Last Snowflake Summit was full of announcements. In this blog entry, I want to bring some ideas about two of them. Snowpark Python and Native Apps. It is amazing that I joined Snowflake around a couple of years ago and I was pretty impressed with the capabilities for Building Better Machine Models Using the Snowflake Data Cloud. That was good, but since then, it is incredible to see all the innovation coming in Snowflake. At that time, I was using Python on the client slide and the Python connector. Now with Snowpark Python we have the capabilities to use data frames that push all transformations down to Snowflake, and even more powerful, define Stored Procedures, User Defined Functions (UDFs) and User Defined Table Functions (UDTF) that are executed on the Snowflake side, using all the performance, elasticity and pay per use of Snowflake Data Cloud Platform. These features will make it easier to bring more ML workloads into Snowflake.

The Data Cloud is evolving with Native Apps as announced in the Summit. With the new Snowflake Native Application Framework, currently in private preview, developers can build applications and monetize them on Snowflake Marketplace, and consumers can securely install and run those applications, directly in their Snowflake instances, reducing the need for data to be moved. In this specific example, I have implemented a Native App that is able to run Machine Learning for a Credit Card Fraud Analysis. It is based on this previous work Building Fast and Furious: Streamlining your Data Pipelines and the idea of adding collaborative Machine Learning. Using new Snowpark Python capabilities I am now able to embed all the logic within Stored Procedures and UDFs, and run all that within Snowflake.

Native Apps come into play in the collaborative effort. The previous example was using transactions from a single bank. What if I could join transactions from two banks to build a better model and protect PII and confidential information at the same time? Understanding a customer’s transactions across several banks would provide a better profile for feature engineering. The algorithm training the model will have to see the transactions of all banks to generate the profile for each customer. This presents a challenge as data with that information is not typically shared because it contains sensitive information including Personally Identifiable Information (PII). ML is making great advancements with federated learning, but in this case, the approach with Native Apps is quite straightforward.

In the architecture proposed, one of the banks will be serving the procedures to build the features (based on Fraud detection handbook ), train the model and perform the scoring. Those will be services offered in an App. This is the diagram:

The Native App will be serving fe_training() and fe_scoring() procedures. The consumer will create its own database based on share bank1_app_db from the provider. The database will have an installer provided by the App with all the objects to be created on the consumer side. That will include the creation of a schema called app where the fe_scoring() procedure will create a table with the result of the scoring.

But the greatest advantage of Native App framework is that all objects are hidden by default. That means that the App provider will determine what objects (tables, schemas, stages, etc) will be visible to the consumer.

The fe_training() procedure will be able to join transactions stored in the provider account, within their own app_data schema, with the tables at the consumer side that the admin will grant access to the App. In the diagram above, the consumer (bank2) is providing read access to the tables on their own database bank2_source_db to the procedures served by the App. The procedure will be able to join those tables, generate features and create the model. This is the high level steps of what the procedure is doing:

This is all Python code that will be executed on the consumer side. The advantage of Native App framework is that the procedure will be able to read data from both Snowflake accounts, but that data will not be exposed outside of the procedure. The provider cannot infiltrate any data from the consumer as it only writes on consumer schemas, and the consumer cannot steal any data from the provider as it is the App who controls the visibility of their objects. The provider will not be able to write or update any data back into the provider account.

In the first diagram, the model is hidden from the consumer. The app developer would have the choice to make it visible to the consumer and let it build their own inference UDFs. In this approach, the model is used later by fe_scoring() that is the procedure invoked by the consumer, and that will take new transactions to score them. Before running the scoring, for each of those new transactions new features will have to be created. The App will follow the same logic, and will also use the transactions of the provider. This is the approach:

The magic of Native Apps to make the choice of what objects will make visible resides in a new “database role” defined and created by users and apps and the system APP_EXPORTER role. This is the one used to provide visibility to the consumer side. In this diagram, you can see how the stored procedures are granted to the APP_EXPORTER role, while the access to the provider´s tables are only granted to the role used by the app and not exposed externally to any users on the consumer side:

Native Apps together with Snowpark Python open a bunch of new possibilities and use cases. Note in this framework there was no data movement between banks, no extra copies and no exposure of PII information. Furthermore, you can implement other differential privacy measures to help keep the data of individuals safe and private. There are other use cases where this can be applicable like money laundering and fraud detection in health care.

For ML practitioners, it can make building models more powerful while protecting your intellectual property. Like in this case, you can build a model that uses your own data but allows customers to bring their own data sets without having to copy, move, or expose and merge them.

If you are a Global System Integrator, reach out to your Partner Sales Engineer if you want to see this demo running.

Please be aware that at the time of writing this blog Native App is in Private Preview and therefore subject to change. And of course, this is all personal opinions and not from my current employer (Snowflake)

Carlos.-

--

--