Data Privacy Stewardship

How to have your app data and analyze it, too: a guide to upholding data privacy standards and best practices for application developers.

Anne Nasato
Slalom Build
12 min readJan 24, 2023

--

Data privacy often comes across as an additional feature to be incorporated into digital systems, if not a nuisance to have to think about and adhere to. It’s less embedded in the development process than security and operations, but one could argue that it is equally important—and this importance is only increasing. In fact, these aspects of digital design and development are not mutually exclusive; they are highly intertwined components of well-architected-and-engineered systems.

Photo by AbsolutVision on Unsplash

The purpose of this article is to increase awareness of data privacy in a design and development context, while also demonstrating that data privacy need not be as punitive or complicated as one may perceive.

A brief history of privacy

It may sound strange to learn that data privacy as a legal concept has its origins in the late 1700s with the creation of the US Constitution. Several amendments refer to the individual’s right to privacy, although at this time it was privacy regarding tangible items. While the concept of privacy has been around since long before the modern era, data privacy really started to form in the 1960s and 70s, when personal information was increasingly collected in the form of surveys, mail, and other hard formats. A series of legal proceedings evolved the definition privacy from solely referring to tangible items, to anything one “seeks to preserve as private.”

At around this time, the use of computers was also becoming more popular. One might think that this would also drive an increased spotlight on privacy, but it was not quite so. Privacy had more to do with wiretaps and opening of others’ paper mail, while security was the priority for computer systems.

Despite the creation of regulations such as the EU Data Protection Directive (1995), the Health Insurance Portability and Accountability Act (HIPAA , 1996), and Children’s Online Privacy Protection Act (COPPA, 1998), online data privacy did not get its moment until the late 2010s. While security has always been front and centre (with good reason), privacy really became a talking point with the EU’s General Data Protection Regulation (GDPR) in 2018. It was at this time that even the common computer user started to think about how their personal information was being stored, processed, and used.

What is PII?

Often at Slalom when we enter project engagements with our clients, we explicitly and (importantly) contractually call out that we will not work with PII in non-production environments. PII stands for Personally Identifiable Information. According to the US Department of Labor, Personally Identifiable Information is defined as the following:

Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means.

Some examples of PII include full name, social security number, and email address. It doesn’t stop there, though; there is also such thing as “composite PII.” This is data that, while on its own may not directly identify someone, when combined with other seemingly benign data can actually identify individuals. These composite datasets must be treated as PII.

There are several ways to approach designing systems to uphold privacy best practices and regulations, some of which will be covered in this article. Before proceeding, it’s important to understand that while privacy and security go hand in hand, they are not the same thing. A secure system is not necessarily a private system, and vice versa. More on this below.

Designing (and developing) for privacy

As data engineers and architects, we have a responsibility to keep data privacy top of mind throughout the design and development process. There are several aspects to this, for example:

  • Classifying data according to risk level
  • Understanding how data will be stored, used, and accessed
  • Regulations impacting the industry of the entity who owns the data
  • Regulations impacting the geographic region in which the entity who owns the data operates

While this list is not exhaustive, these data privacy components and practices are layered on top of a secure architecture. A secure architecture, as a baseline, ensures certain functionality. Data is encrypted at-risk and in-transit; least-privileges principle is followed when granting system access to people and systems; network configuration allows required traffic while denying anything and everything else.

Data privacy encompasses the above, and then some. When evaluating how data should be handled, we often ask the following questions:

  1. Where is this data sourced from?
  2. How is this data consumed? (Typically either by downstream users or applications. This may also encompass how often the data is consumed.)
  3. Where does this data need to reside? (This refers to the geographic region of the hardware housing the data. There are often legal requirements that dictate this.)
  4. How long does the data need to be retained? (Also important: how should this data be handled when it is decommissioned?)
  5. Who should have access to the data? (And are there different views for different consumers? Meaning, it is often the case that certain users can access all data, including high-risk, while others only have access to desensitized data.)

The responses to questions such as those listed above, together with a secure platform architecture, are key to helping shape a system that ensures data privacy. The importance of this cannot be overstated. Data privacy protects both the data subject and the data owner.

Have your app data & analyze it, too

In working with software engineers, it is no surprise that NoSQL databases such as Azure CosmosDB are preferred for many applications. NoSQL databases offer greater performance and flexibility than their relational counterparts, which translates to a better user experience and evolving application components. Lower-latency reads enable a snappier interface. Flexible schemas allow for the addition, removal, and evolution of fields as application use cases and requirements change.

However, it is common for application owners to want to analyze this collected data in order to glean insights which will help improve their business and/or operations. While a NoSQL database can definitely serve the application layer well (and by well, I mean that it is performant and can provide necessary security and privacy controls), it may not be the best choice for the analytics layer (as security and privacy controls are typically more difficult to implement). This is because granular access control at the field level is more challenging in a NoSQL database as compared to a relational (SQL) database. It is possible to achieve granular access control at the field level within a NoSQL table item, but it will likely require a custom-coded solution which requires additional custom testing and maintenance. These challenges are further amplified as the database itself evolves.

A more straightforward solution is to house the analytics data in a SQL database. This enables responsible data handling out of the box, without a need for custom development or maintenance, resulting in increased peace of mind around an entity’s data stewardship practices. The main requirement here is to understand the data being collected at the field level, in order to perform a sensitivity analysis. From here, there are a couple of options.

Sample architecture for separate application and analytics databases on Azure. This focuses specifically on database options and not downstream consumers such as data visualization tools or long-term data storage.

Enabling app analytics without compromising privacy

Once a field-level sensitivity analysis has been performed, PII fields can be distinguished from non-PII fields (including composite PII analysis).

One potential design is to load all the data from the NoSQL application database to the SQL analytics database.

Pros:

  • Straightforward pipeline setup (essentially a SELECT * on all fields). However, if hashing or encryption are to be applied as part of data processing, this “Pro” becomes a “Con” as the pipeline setup is no longer simpler than the alternative solution.
  • Minimal future pipeline maintenance.

Cons:

  • Requires more attention to the analytics database in terms of how sensitive fields are handled (fortunately, most modern cloud relational database systems have features for this out of the box; see Azure SQL Database Dynamic Field Masking).
  • Introduces more risk for potential exposure of sensitive data to unauthorized users in case obfuscation is incorrectly applied, accidentally removed, etc.

Another potential design is to only load a subset of fields from the NoSQL application database to the SQL analytics database. Specifically, fields that will enable meaningful analytics while also minimizing the risk of mishandling sensitive data.

Pros:

  • Lower risk for potential exposure of sensitive data to unauthorized users, since PII (including composite PII data) is not available in the analytics database.
  • For the improved risk profile, the pipeline maintenance effort is minimal, if not comparable to the effort required for the first option in the analytics database itself; the initial query setup is more involved but field additions/removals/changes beyond are straightforward.

Cons:

  • Increased potential for iteration on the initial pipeline and database setup. May require some testing with analytics users to determine if the data available is sufficient to drive valuable insights, or if additional features are required.

Architecting for apps & analytics

In architecting an application solution with privacy in mind, the consumption and utilization of data for both the application and analytics need to be carefully considered.

Without any custom code to mask individual key fields upon document retrieval from the NoSQL application database, users can view all the data in a given document when performing analytics. While the application API layer may only serve certain fields based on the request, this is not as efficient for analytics users who typically perform analysis and aggregations over the complete dataset.

Data in the application database should be encrypted at rest, with fine-grained access control applied. However authentication and authorization is handled by the application, users should be granted access at the document-level. This can be achieved via resource tokens.

For example, if your application processes claims data, then the claimant and the rep should only have access to the document(s) containing information related to their particular claims. This can be achieved by leveraging users’ unique user IDs within the documents along with application authentication and authorization.

Granting users access at the item (also referred to as “document”) level is the most granular way to control application database access in CosmosDB without requiring custom code.

Additionally, it is common to have a small (but mighty) group of super users and can access everything, but only in extenuating circumstances and not part of day-to-day business operations.

Analytics users should have their own roles and permissions. Based on the segregated data architecture proposed in this article, an analytics user’s permissions set should not include access to the application database, and instead focus solely on the analytics database. Azure SQL Database allows for permissions to be applied at the logical server level as well as at the database level.

Creating a separate database for analytics users is a good practice to lower the risk of improper PII data handling. While Azure SQL Database is the recommended database in this article, there are other options available, such as separate Azure CosmosDB containers.

The implication of having an application database and an analytics database is that there are now two databases in the system architecture, as opposed to just one.

This setup may seem daunting or overly complicated, but the two databases are serving separate layers of the overall system. Additionally, this setup is the simplest way to support a performant and flexible application while enabling an application environment with which to drive valuable insights.

If this system is not immediately feasible to implement, there is a decision to be made in terms of which type of database to proceed with for the solution. In the case that the application is the priority over analytics, a NoSQL database likely makes the most sense. However, if the application and analytics are of equal importance or if analytics is the priority, a SQL database can serve the application features while also enabling responsible analytics. This forces the system owners to proceed with the first option in “Enabling app analytics” above.

For the next section in this article, it is assumed that the dual-database design can be achieved, and thus follows the second option in which a subset of fields are available for analysis.

Sample Solution in Azure

While this article abstracts away the complexity of developing an application and interfacing this with databases, it is intended to communicate the feasibility of developing privacy-compliant data systems.

The following demonstration leverages Azure services but could be achieved in AWS, GCP, or any other platform (cloud or on-premise). The services used are the following:

  • CosmosDB for the application database
  • Synapse Analytics (Pipelines) for data transfer between the application database and analytics database
  • Azure SQL Database for the analytics database

The sample use case for this demonstration is an insurance claims application. This data often contains PII but may also need to be analyzed, so it is a fitting real-world example for this topic.

Claims data is written to CosmosDB by the application. Items in CosmosDB are in JSON format. Access to this dataset should be locked down such that only users with a matching “repId” or “claimantId” to a given claim, or super-user status, can view this data.

Azure Data Explorer view of CosmosDB items.

At this time, the analytics database is empty.

Azure Data Studio Notebook view of analytics database “CREATE TABLE” and “SELECT” statements, with no data present (yet). Pay special attention to the data types used for fields in the analytics database. The above image is a simplified example.

The subset of data from the application database corresponding with the columns in the analytics database will be transferred via Synapse Pipelines. To set this up, the CosmosDB dataset and the Azure SQL Database dataset both need to be configured as Integration Datasets in the Linked Data section of Synapse.

Synapse Workspace Linked Data, Integration datasets with the CosmosDB application database (AppData) and the Azure SQL Database analytics database (AnalyticsData).

In order to enable Synapse Analytics to read data from the application database, the CosmosDB account Networking rules will need to be updated to allow access from Synapse Workspace. The same applies to the Azure SQL Server on which the Azure SQL Database is hosted. This can be configured on the Networking blade from the left-side menu in each respective service interface.

Data movement will occur via a “Copy data” activity in Synapse Pipelines. This is under the “Move & transform” Activities category. This is one of the most basic activity types available in Azure Synapse Analytics.

Synapse Workspace Copy data activity for data pipelines.

The “Copy data” activity requires both the “Source” and the “Sink” to be configured. In this example, the source is the application database (CosmosDB).

Source configuration for Synapse Copy data activity.

The “Query” field indicates the fields to be selected from the application database to the analytics database as part of the Copy data activity.

The sink is the analytics database (Azure SQL Database).

Sink configuration for Synapse Copy data activity.

After configuration is complete, all edits can be validated and then published. For this basic tutorial, there is no Git repository configured and Synapse Live mode is utilized.

There are several ways to initiate pipelines in Synapse. They can be event-based triggers, scheduled, or manually kicked off. For the purposes of this tutorial, the pipeline is run in Debug mode. The pipeline is first “Queued,” then the status becomes “In Progress.” The final outcome is “Success” or “Failed.” This is visible in the Debug interface of the workspace.

Successful Synapse Pipelines run.

To confirm the pipeline has successfully completed, check for data in the analytics database.

Specified subset of application data from CosmosDB in Azure SQL Database analytics database.

The data that now exists in the Azure SQL Database upholds data privacy while enabling analytics. Should the cleartext data need to be mapped to, users with appropriate access can query the “claimId” against the application database. However, this data is often not required for analytics, which relies more so on aggregations for trend mapping and pattern identification in the data.

Closing notes & recommended resources

While the ingestion, storage, and consumption of sensitive data all require awareness of regulations and best practices, the decommissioning of this data also calls for special care in its handling. For example, the state of California requires the “shredding” of PII at this phase in its lifecycle.

Different regions have different rules for data privacy compliance and it is the responsibility of architects and engineers to be aware of these standards. One go-to resource worth bookmarking is the International Association of Privacy Professionals’ (IAPP) “Tools & Trackers,” especially the Global Privacy Law and DPA Directory:

Another useful guiding resource is Privacy by Design. This framework consists of seven principles and was developed by Ann Cavoukian in her role as Information and Privacy Commissioner of Ontario, Canada, in 2009. Privacy by Design is less specific and timely than existing rules and regulations, but it provides a good idea of how to think about building systems that ensure responsible privacy stewardship.

As per the fourth Privacy by Design principle, developing privacy-compliant systems is a “win-win” for the data subject and the data owner. While data privacy is often discussed in high-profile trials and regulatory hearings, it does not necessarily require high-complexity systems to be achieved. Hopefully, through time and practice, data privacy will become as embedded in the design and development process as security and operations. Awareness of data privacy is the first step.

--

--