Security-aware MLOps

Vechtomova Maria
Marvelous MLOps
Published in
5 min readDec 3, 2023

Machine learning engineers and data scientists are not always aware of best security practices, which creates real risks. MLSecOps became a separate discipline that covers 5 specific areas:

  • Supply Chain Vulnerability addresses the risks associated with the dependencies and components of ML systems, such as third-party dependencies, data sources, and infrastructure.
  • Model Provenance refers to the traceability and reproducibility of ML systems, such as the ability to identify what code, data, model artifacts, infrastructure, and environment was used for certain model deployment.
  • Governance, Risk, and Compliance focus on establishing procedures and controls to ensure ML systems comply with regulations and internal security standards.
  • Trusted AI refers to the transparency and explainability of ML systems.
  • Adversarial Machine Learning focuses on defending ML systems against adversarial attacks, such as input perturbations or poisoning attacks.

This article will focus on dangers coming from third-party dependencies, more specifically, Python packages, docker images, and external GitHub Actions.

Python security threats: pip install malware

It might not be obvious to everyone, but PyPI also contains malicious software. PyPI is an open package registry, and even though its security team does great work to take down malicious packages, they can’t catch every problem. If you see a security issue, you can always report it here: https://pypi.org/security/.

When it comes to security, every detail matters. PyPI contains malicious packages that rely on human errors:

  • Misspelling: for example, library requests is legit. Libraries rrequests, requesys were malware.
  • Versioning confusion (requests is legit, requests3 was malware)
  • Naming confusion (python-dotenv is legit, dotenv-python was malware)

These examples are types of typosquatting that can turn small mistakes into big consequences. This is though not a Python-specific problem: attackers register domains that look almost like the legit ones (for example, there are stories about twitter[.]com, goggle[.]com).

PyPI also contains packages that look legit, but have malware dependencies. Some packages may become malicious over time. This may happen when packages get new maintainers or when the account of a package developer gets compromised. You can read the story about fastapi-toolkit that got compromised on the Datadog Security Labs blog.

The way Python malware typically works is via compromising __init__.py files. Those files look normal but contain lines that download malware: the lines are over 1000 characters long, so you do not see the code when you look at the __init__.py file.

How to protect yourself?

  • It is always a good idea to review the source code, especially when the package is not widely used
  • Avoid typing package names into shell (and never do it with sudo!)
  • Use automated scanning tools like Dependabot

Check out the talk by Max Kahan for more details!

Docker security threats: docker pull malware

Dockerhub contains a significant number of malicious docker images, not to mention the number of images with security vulnerabilities.

Typosquatting, just as for Python packages, is the most common way of distributing malware contained in docker images. The images appear as popular open-source software to trick users into downloading and deploying them (for example, vibersastra/ubuntu or vibersastra/golang).

Also, if you are a data scientist or a machine learning engineer trying to test out LLMs and have found an unofficial docker image that might be helpful, you need to be cautious.

The majority of malicious docker images contain crypto-mining software. After the docker run command is executed, the malware will mine crypto blocks and submit them to the crypto wallet. Under certain conditions, the process can affect other containers and the host.

How to protect yourself?

  • Only use official images, update the version regularly so that the base image contains all the latest security patches.
  • Inventory the image (files, packages, modules, libraries, licenses).
  • Always perform vulnerability scans for OS packages, libraries, and modules.
  • Never run as root. Always run as unprivileged.
  • Block all egress by default. If needed, restrict to specific target hosts only.

Read more about malicious docker images in the analysis of supply chain attacks through public docker images and an article about cryptojacking by Palo Alto Networks.

GitHub Actions security threats: 3rd party actions

One non-obvious security threat can be coming from GitHub workflows if you use GitHub Actions developed by someone else. Your dependency action may be infected with malware and used to steal your secrets.

Source: https://www.paloaltonetworks.com/blog/prisma-cloud/github-actions-worm-dependencies/

The main repository where Action is defined may be taken over by the attackers via repojacking, or compromising a maintainer’s access token or credentials. Attackers can then inject malware and overwrite tags and branches of the repository containing Action.

If the dependent repository references the Action as {owner}/{repo}@{tag} or {owner}/{repo}@{branch}, the malware will be imported into a GitHub Actions workflow.

When a job starts, the GitHub Actions runner receives all the secrets used in the job, so the runner’s memory can be dumped and sent to the attacker to reveal all secrets defined in the job.

Check out the GitHub Actions worm dependencies blog by Palo Alto Networks for more details on how this attack works.

How to protect yourself?

  • While tags and branches on GitHub can be overwritten, commit hashes can not be. Pin actions using a commit hash (not branch or tag!) to minimize the risk of using a maliciously modified action.
  • Set GITHUB_TOKEN and PAT contents permission to the minimum required.
  • Configure branch and tag protection.
  • Monitor and limit outbound network connections from workflow runners to prevent the download of malicious code into pipelines and prevent malware from reporting to C2 servers.

Conclusions

The number of machine learning projects will only grow in the future, and everyone involved in the development must understand the security implications of using third-party tools in machine learning systems.

Hopefully, by writing this blog, we increase this awareness in the machine learning community.

--

--