Sharing non-public data in Jupyter notebooks
Keeping data safe while collaborating with others
In our Watson Developer Advocacy team we’ve been collecting statistics to measure our reach for several years now. Traditionally we’ve used Business Intelligence software, such as Looker, or custom-built web applications to create interactive reports to facilitate analysis.
For a recent project that analyzes sample application deployments to Bluemix, I’ve evaluated Jupyter notebooks as an option for those users who want to run their own advanced analytics over data that our Deployment Tracker Service is collecting. Notebooks are commonly used in the world of Data Science, providing an interactive computing environment for small- or large-scale data engineering and analysis.
Based on the project’s findings, this post outlines two data-security aspects that one must consider before making non-public relational data available through notebooks.
A notebook is made of “cells,” and there are two types. Input cells contain executable code (many languages are supported) or markup that produces output (like text or graphics). Output cells display this output.
Data can be loaded into a notebook from a variety of local and remote sources — both public and private — by running code in input cells. This code requires credentials, such as host, user id and password to connect to the data source. Credentials can be embedded in the notebook or read from sources like environment variables or a file when a cell is executed.
Credentials can be hidden from users that only view the notebook, but are in general exposed to users that run the notebook. (There is no built-in abstraction layer in notebooks like an application might provide that hides connection information.) To prevent manipulation of data sets or unauthorized access to other data sets in the same data source, the credentials must therefore only hold minimal authority.
After a data set is loaded into a notebook by running the appropriate cell(s), all data values can be accessed by the user, including information that might be sensitive or irrelevant to the analysis. To avoid potential exposure, filters must therefore be put in place on the data source that hide, mask or anonymize data as needed.
Limiting access to data sets
Access to data stored in relational databases can be usually restricted in many ways (no/read/write access, restricted catalog visibility, row-level data access control, etc.), making it easy to implement a basic strategy:
- A dedicated user is used, making it easy to monitor and audit all data access requests.
- The user is granted SELECT (read-only) access to the relevant database objects, preventing intentional or unintentional attempts to modify the source data.
- The user’s access to catalog tables is restricted. The database can therefore not be easily explored using the exposed credentials.
- Only data that’s deemed to be of relevance to the analysis is exposed, as described in the following section.
Enforcing data privacy
Data that’s loaded into the notebook from any type of data source is no longer protected from prying eyes. Therefore, each piece of information in the source data set needs to be classified into one of these categories:
- Category 1: values that are not sensitive, and are of potential relevance to the analysis, can therefore be shared as-is. In our deployment tracking service for Bluemix, such a value would be the types of Bluemix services bound to a particular application, which, for example, could be a Cloudant NoSQL database.
- Category 2: values that are sensitive or irrelevant to the analysis — these must not be shared. While the type of a bound Bluemix service instance is relevant, the assigned globally unique identifier 9ab…32f for the Cloudant NoSQL database does not contribute to new insights.
- Category 3: values that are sensitive but relevant to the analysis — or at least required in order to perform the analysis — these must not be shared as-is. An example of such a value would be a key that can be used to combine data sets, or a key that represents a unique identifier that’s needed to perform aggregations.
To ensure data privacy we created a view of the data in the source database objects that provides raw, filtered or masked access, as needed:
- Category 1 values are represented in their original form.
- Category 2 values are omitted from the data set.
- Category 3 values are processed using a one-way cryptographic hash function. The calculated hash values retain the identifying properties needed to correlate records within a data set and could still be used to combine data from multiple data sets, without exposing the true values.
A relational solution
In this post I’ve outlined some of the lessons we’ve learned during our first attempt to expose non-public data from a relational data source in a Jupyter notebook to enable internal users to perform their own custom analysis.
Some of the approaches we’ve taken are likely not suitable for other data source types or other types of data, and more investigation is needed on how to properly protect those. Let us know how you’ve secured data access for your notebooks in the comments below.
If you enjoyed this article, please ♡ it to recommend it to other Medium readers.