Security in Advanced Analytics and Machine Learning Environments

11 min readAug 25, 2022

1. Introduction to Advanced Analytic Environment

According to Gartner Advanced Analytics (AA) is the autonomous or semi-autonomous examination of data or content using sophisticated techniques and tools, typically beyond those of traditional business intelligence (BI), to discover deeper insights, make predictions, or generate recommendations. Advanced analytic techniques include those such as data/text mining, machine learning, pattern matching, forecasting, visualization, semantic analysis, sentiment analysis, network and cluster analysis, multivariate statistics, graph analysis, simulation, complex event processing, neural networks.[1]

You can look at AA as a term for a wide variety of analytics techniques and tools that work together. Most commonly, the term refers to data mining, machine learning, big data analytics, forecasting and generally, finding patterns in data.[2]

Tasks that can be executed using advanced analytics include:

Segmentation — grouping items based on similarities

Classification — of unclassified elements according to shared qualities

Correlation — identifying relationships between element properties

Forecasting — derivation of future values

Association — identification of the frequency of joint occurrences and the derivation of rules such as “C usually follows A and B”.

Advanced Analytics Environment (AAE) used by data scientists and data engineers are (cloud) environments that require access to setup and use Compute Instances (CIs), production/real data, models, repositories, etc.

The AAE is a mixture of development access requirements, combined with the availability of real production data.

2. Security challenges within AA

Looking at the requirements and the expectations for an Advanced Analytics Environment (AAE) from a security perspective, we notice the following elements:

a) Development like type of work in a production environment

An AAE is production level and should be treated as such. The infrastructure and data are to be considered production. Data Scientists do development like type of work, while exploring data, require full data access, and experimenting with technologies available to them. One could argue that data analysts do the same type of work and require full access as well.

Most companies fall in the trap of name juggling while developing a model, and productionizing models. Confusion of terms and environments happen, especially when talking to infrastructure and other teams who are accustomed to these terms from typical software engineering practices.

Once the technology choices are made and models are prepared and trained, you could/should switch to another activity model more in line with software engineering practices: make and package a model in development, deploy it to acceptance for testing, and eventually run it in production environment with the purpose to train models, track progress, select best fit models, deploy, etc.

b) Elevated/privileged access to the environment for users

The access the Data Scientists require are similar to access a developer would require. The users would need to be able to setup/deploy/access resources, libraries, containers, CIs, etc.

Additionally, the Data Scientists require access to the data. Their models and analysis should happen on real(-time) data, and not scrambled or randomized data.

c) Ability to/will run unknown code/own code (mostly python)

The Data Scientists will want to run unknown or own/custom code. Code re-use is required. To support this, more strict validation needs to be in place. Not only do we require validation, but we will also need to monitor and follow up on future vulnerabilities, just like you would expect to do with any other software you run in your environment.

d) Volatile environment with CIs being spun on and off constantly

Often there are requirements for more compute power (CPU or GPU). This comes with a cost and we would want to avoid having massive running costs without utilizing it properly. This is where the cloud environment comes in nicely[3]. The cloud will allow us to scale up and down whenever we require. We must ensure those CIs don’t allow compromise of our environment. We wouldn’t want an attacker with the right timing, get access to a temporary CI and from their compromise the rest of the environment.

e) CIs should be Cattle[4]

CIs are often still too much pet: they are long lived, assigned to individuals, ….

Ideally, all infrastructure is elastic in both directions helping to optimize costs. Scaling up if required and scaling down to zero if unused. By treating CIs as Cattle, you take the assumption there will be a few bad ones in the bunch that you can need to (easily) replace, rebuild, and destroy. It also indicates that you won’t be micromanaging your CIs. As a result, we will need to build additional fences around our Cattle and ensure we can allow them to run free in our contained environment.

3. Data Driven Security

By focusing on data, we focus on the companies’ core assets. The interesting part when looking at it from a data perspective is that we have CIs that we should treat as Cattle. The Cattle however is eating our grass, which is our production data. From a data security perspective, the Cattle could create a risk when it comes to Confidentiality, Integrity, and Availability.

3.1 Integrity and Availability

Integrity together with availability can be addressed by utilizing root data that is hosted or stored on data stores that are setup in a redundant way and are configured as “read only” for the Cattle. As a result, we can consume the root data but not alter nor destroy the data. If changes to the data need to be made, a local or specific copy will be created of the data set, as a result not altering the original data. This will provide you an ability to compare your result to the original input, potentially helping you to have an audit trail of your data and data alterations (sometimes this is a very strict business requirement).

From security perspective we want to ensure the root data is only writeable by the corresponding ingestion process. You can have data derived from the root data or you can have copies from the data on which you can do alteration.

Policies and access rules can be written to enforce these access and data principles. The functional usage won’t be impacted, because users can create and easily deal with the variations of the root data.

3.2 Confidentiality

This is the more challenging area, since we have users with elevated privileges that want to run their own code from different repositories, containers, and locations.

Recent history has shown that software supply chain attacks[5], repositories or container registries can introduce vulnerabilities or even malicious content[6]. Once they are run on the CIs it is possible that a C2[7], CryptoLocker[8], or other malicious code is being run on the environment.

We would like to ensure these CIs can only talk to the resources that they require. This to prevent potential contamination in case of infection or compromise. Since the CIs are short lived and very volatile, we can’t rely on antivirus or antimalware on the CIs (also note that most containers don’t have a virus or malware scanner) and we need additional controls.

One of these additional controls is a Firewall. We would like to ensure that the CIs can’t talk to the outside world unless they really need to and that they also can’t talk to each other (segmentation). There is a need to ensure that the connectivity is limited, this to avoid potential data leakage, at the same time it will also help reduce the attack landscape.

To address the security of the containers and repositories you can consider placing a repository manager[9] between the CIs and public repositories. This way you can enforce which repositories are allowed and not allowed to be used. It will also give you an overview of which repositories that are used, and you can check how many “vulnerable” you are using. Additionally, you would also place a Container Registry Scanning solution before your deployment of containers. A scan would provide insights in the vulnerabilities on that container, and you would be able to detect bad (or outdated) containers and avoid them from being used. Misconfigurations would also be detected by these tools.

Network access to these CIs should also be limited and only granted on a real need based on data flows, preferably end user access to the CIs is only possible via dedicated jump hosts with strong MFA (Identity is the new perimeter). This will help reduce the threat landscape and attack surface in case you are running vulnerable CIs.

CIs are also often build once and never changed afterwards. Implying that a security scan at build is sufficient. When these are long lived and users can install repositories, you could end up with a vulnerable CI. It will be crucial to ensure that once a CI is built this is done properly based on a registry which have been scanned and don’t contain vulnerabilities. Every time a new CI is created based on this registry; a verification should be made to ensure no vulnerabilities are in the registry. The CIs should be decommissioned as soon as they are no longer required. This can be up to several hours or even a few days, but a week or more should be avoided.

4. Reference architecture with security controls

Advanced analytics platforms typically have the following phases[10]:

data ingestion
data storage
processing
presentation (and visualization)

The four phases of advanced analytics platforms

Every phase will have different actors involved and as a result will have different threats. As there are different actors it is important to ensure Role-Based Access Control and ensure access rights are given only when needed.

Next, we will go through each of the phases and highlight some potential threats.

4.1 Data ingestion

In the data ingestion phase data from different sources are accessed and ingested for consumption and processing. The data can be from public or private resources. Ideally the data is ingested via pipelines. To protect from potentially malicious data, we need to verify the data before consuming it. We can do this by deploying an anti-virus/anti-malware check of the data. Preferable we do this before storage. Additionally, we would also want to avoid any data can be ingested, this can be done by configuring firewall rules so that only authorized resources can be accessed. Sometimes you might not need certain elements of data sets, and perhaps due to regulations or privacy concerns you might want to mask, obfuscate, or alter data before storage. This is typically done in the ingestion pipeline.

4.2 Data storage

The data storage phase will ensure that all data is brought together in a consumable location. Often data lakes are considered for this. In the ingestion phase it was highlighted to verify and scan the data before it is consumed. The data should only be written by the ingestion phase and no other actor(s) can or should write to the data storage. As mentioned in section 3, to ensure data integrity it is best to set the data in read only. When confidential data is being dealt with, it is advised to ensure the data is encrypted. It is also important to understand that not all data needs to be stored. Only data that is to be consumed should be stored.

4.3 Processing

During the processing phase instances will be provisioned and decommissioned. The instances are built out of a container registry and use repositories and libraries available to them.

We already pointed out in section 3 that the Compute Instances running in this phase should be segregated and not be able to talk to each other.

The container images and repositories or libraries are often available on public resources (e.g. Github, Anaconda). To ensure the CIs don’t load malicious libraries or container images with vulnerabilities we could use a library and repository manager and a container registry scanning tool. We could start with a whitelist or blacklist approach for the registries and repositories/libraries.

It might be worth to consider running vulnerability scans on the instances in run-time/real-time. For short lived instances, this might produce output, which is looked at, and addressed too late. The vulnerability might not be relevant or existing anymore since the instance is no longer running.

Access to the CIs should be limited. Since we would also like to avoid everyone being able to talk to the CIs it is advised to make use of a jump host/proxy/bastion/… to access the CIs.

There are also known vulnerabilities inside machine learning, such as poisoning the data or crafting customized input to force (wrong) outcomes[11]. With these vulnerabilities, it is possible to manipulate and influence a model or outcome. [12] When a model needs to be exposed to a wider audience, it is important to test and train the ML models against these kinds of attacks. If the model is in a contained environment (limited users, controlled datasets, …) the likelihood of this vulnerability goes significantly down. There are tools[13][14] available to test the models against these kinds of vulnerabilities.

4.4 Presentation

The presentation or visualization phase is the phase where the results are presented to the end-user/consumer. Since the output might be confidential by nature, we would need to ensure that the results are only viewable on a need-to-have basis. It is important to ensure proper end-user access control to the presentation layer.

5. Bringing it all together

The picture below provides a high-level overview of the different phases and security controls in an advanced analytics environment.

From an end-user perspective we typically see three different profiles/roles to be involved.

- Data Scientist/Engineer

- Security

- Consumer

The below picture provides a high-level overview of the interactions you would be expecting by the different profiles/roles.

The phases with security controls and actors

6. Conclusion

When looking at the Advanced Analytics environments from a security point of view you will notice a lot of challenges. Most likely the most important concerns will be regarding confidentiality, especially if you will want or need to do analysis on confidential data. To preserve the confidentiality, you will need to look at the different phases of advanced analytics and you will need to build a defense in depth model, ensuring only necessary data flows are authorized, and only people who need access to the data have access to the data.

To do this, it is extremely important to understand the requirements your business consumers and data scientists or engineers have. Unfortunately, there is not a one size fits all solution, but I hope this document has provided some insights on how you could tackle or think about some of the concerns from a security perspective.

While the field of Advanced Analytics Environments will grow and mature so will security and the security requirements of the platform. During the last year I have already seen big improvements by products and vendors, I am convinced this will continue in the next years to come, unfortunately at this moment (anno 2022) we will sometimes have to look for non-conventional solutions to address one or more security risks in these platforms.