MLOPS in 10 Steps: Building and Deploying an App on GCP to Classify Ransomware Actors’ Domain Generation Algorithms

Published in

Institute for Applied Computational Science

11 min readDec 17, 2023

This article was produced as part of the final project for Harvard’s AC215 Fall 2023 course.

Project Github Repo — https://github.com/rob-chavez/ac2152023_cybersafe

According to the US Department of Homeland Security’s 2024 Homeland Threat Assessment, “Ransomware attackers extorted at least $449.1 million globally during the first half of 2023 and are expected to have their second most profitable year [ever].”

Ransomware is illegal; those who practice it are selfish and care little for people but it is a lucrative business. As such, companies must be prepared to face malicious actors who seek to hold their data hostage with encryption techniques for money.

One way ransomware actors stay connected to computer hosts that they have infected with malware is through the use of Domain Generation Algorithms or DGAs, which will be the focus of this article. Here is an example of such an algorithm that was found on Wikipedia to highlight the DGA of the ransomware group CryptoLocker:

Let’s go a bit deeper. Pretend for a moment that you get an email that tricks you into opening a malicious Microsoft Word Document that delivers various strains of malware to your computer. A hacker can now use a command and control (C&C) server to communicate with the malware on your computer to gain unfettered access to your system and possibly the entire network to which it belongs. It’s game over once a ransomware actor gets access to the data they want to encrypt and hold hostage for money.

One way to prevent such an event from happening is to blacklist the domain names of C&C servers that malware calls back to so that it can receive commands. Shutdown that communication and the hacker no longer controls your system. This works if the domain of the C&C server is hard-coded in the malware, but hackers are too smart for that: introducing DGAs — an algorithm that can generate hundreds of new, random domains that malware can reach out to during attacks, making it harder for victims to block those domains. The image below captures this idea.

Image source: https://hackersterminal.com/domain-generation-algorithm-dga-in-malware/

Now that we know a little about DGAs, let’s go over the goal of this Medium post, which is to capture in 10 steps how use MLOPs to build, deploy, and serve a BERT-based model on Google Cloud Platform (GCP) using PyTorch to Classify Ransomware Actors’ use of DGAs. The steps will incorporate the use of machine learning operations or MLOPs that would allow someone to easily replace the BERT-based model with another model if that model performed better. To help guide us, let’s take a look at the service and technical architecture for this project:

STEP 1: SETUP YOUR GIT REPO

Each of the steps below with the exceptions of STEP 4 (EDA on Google Collab) and STEP 5 (setting up an account with Weights & Biases) should be containerized and pushed to Github, which allows anyone on a project team to pull and update the code as needed.

STEP 2: GET THE DATA

The data for this project is made up of 31 million domains representing 28 unique DGAs produced by known hacker groups. Also included are 1 million “legit” or benign domains. The data can be found via the following sources: (1) https://data.mendeley.com/datasets/y8ph45msv8/1 and (2) https://majestic.com/reports/majestic-million. With approximately 650MB in size, the data was saved to a private Google Cloud Bucket using a containerized extraction script that can be found here.

STEP 3: FORMAT AND PREPROCESS THE DATA

With data in hand, the next step is to create a transformation data pipeline that formats that data into parquet files with custom labeling and versioning. This sounds more complicated than it is, which boils down to labeling the data and storing it by the date when it came in. It’s worth exploring different techniques for data processing. Experiments using the Python library Dask for parallel processing, for example, resulted in slower processing times than simply using the standard Python library concurrent.futures. Here is an example of what the data looks like once it’s been formatted:

STEP 4: EXPLORATORY DATA ANALYSIS AND MODEL EXPERIMENTATION

Google Collab provides cloud-based Jupyter notebooks, free access to GPUs, and overall is a great environment for conducting EDA and model experimentation. Conduct your own EDA or take a look at what was done here to provide foundation that can be built on.

STEP 5: EXPERIMENT TRACKING

Keeping track of all your experiments in STEP 4 and STEP 6 can get out of hand once you start looking at finding the best models and their hyperparameters. Luckily, companies such as Weights & Biases provide tools that allow you to track various combinations of parameters for model training. For this step, it’s recommended that you get an account with a company like Weights & Biases, which will be used in the next step.

STEP 6: CONNECT SERVERLESS TRAINING TO MODEL DATA PIPELINE

GCP’s Vertex AI enables developers to leverage serverless training, which eliminates the need to manage and provision complex infrastructure. This also enhances collaboration and accelerates the development lifecycle, especially when combined with experiment tracking (see STEP 5 above). As shown here, this effort can be combined with the data pipeline steps noted in STEPS 2 and 3 to train models with the most recent versions of data. The image below shows the ML pipeline for extracting and formatting data as well as training various models on GCP’s Vertex AI.

The images below show the models that were trained with serverless training and some of the experiments performed using Weights & Biases. In this case, the best model — a fine-tuned BERT-based-Uncased model — had the best performance and its model artifact was saved to a GCP bucket.

STEP 7: OPTIMIZE MODEL

Reducing the size of a model through distillation techniques, pruning, and/or quantization methods can significantly enhance its efficiency and deployment feasibility — e.g. the ability to use CPU vs. GPU. By implementing these strategies, the model’s computational demands are diminished, leading to faster inference times and reduced memory requirements often with little hit to performance. As shown here, PyTorch provides a simple solution for quantizing a model that is three times smaller with little impact on accuracy.

PyTorch code to Quantize a Fine-Tuned BERT-based Model

STEP 8: DEPLOYMENT

Deploying a PyTorch model to an endpoint on GCP’s Vertex AI took a few tries, so let’s give this some attention.

First, we need to create a MAR file — which is short for a Model Archive file. To do this, you will need to utilize PyTorch’s torch-model-archiver. The torch-model-archiver is specifically designed to package PyTorch models, along with their associated artifacts and dependencies, into a format that can be easily deployed using TorchServe. It allows you to create model archives, which are compressed files containing the model files, code, and configuration necessary for serving the model. Here’s how to install it:

python -m pip install torch-model-archiver

Next, let’s look at an example of how we will use the archiver to create a MAR file:

torch-model-archiver -f --model-name=model --version=1.0 \
  --model-file=/home/model-server/model.safetensors \
  --serialized-file=/home/model-server/bert_dga_classifier.pt \
  --handler=/home/model-server/handler.py \
  --extra-files "/home/model-server/config.json" \
  --export-path=/home/model-server/model-store

Ok, now let’s go through each line starting with --model-name, which obviously should be the name you give the model. For some unknown reason, Vertex AI wanted it to be called model.mar, so I named it as such. Then again, might have been misinterpreting the error at the time.

Anyhow, moving on: --version, again obvious.

To get the values for the next command line arguments --serialized-file and --extra-files, the following code needs to be run that simply saves the PyTorch model and the tokenizer artifacts into folder called “my_tokenizer” and “my_model,” as shown below. This will produce a *.bin (it also might be called *.safetensors) file that basically is the dumped state_dict of the trained model weights. The --extra-files argument contains the configuration information of the model and the information needed to tokenize the text. This tokenization files are optional though since it’s possible to initialize the tokenizer in the handler.py file, which we will get to next!

#PYTORCH
import torch

#TRANSFORMERS
from transformers import BertTokenizer
from transformers import BertForSequenceClassification

device = torch.device("cpu")
saved_model = "bert_dga_classifier.pt"
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=31)
model.to(device)  # Move the model to the CPU
model.load_state_dict(torch.load(saved_model, map_location=device))
model.to(device)
model.eval()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.save_pretrained('./my_tokenizer')
model.save_pretrained('./my_model')

For --handler argument, you will provide the name of the python script responsible for defining the procedures related to data preprocessing, inference, and post-processing. Here is the script used — it’s not perfect but it’s easy to find examples on the Internet of how others coded this script:

from ts.torch_handler.base_handler import BaseHandler
import os
import logging

#PYTORCH
import torch

#PANDAS
#import pandas as pd

#TRANSFORMERS
from transformers import BertTokenizer
from transformers import BertForSequenceClassification

logger = logging.getLogger(__name__)

labels = {'symmi':0, 'legit':1, 'ranbyus_v1':2, 'kraken_v1':3, 'not_dga':4, 'pushdo':5,
          'ranbyus_v2':6, 'zeus-newgoz':7, 'locky':8, 'corebot':9, 'dyre':10, 'shiotob':11,
          'proslikefan':12, 'nymaim':13, 'ramdo':14, 'necurs':15, 'tinba':16, 'vawtrak_v1':17,
          'qadars':18, 'matsnu':19, 'fobber_v2':20, 'alureon':21, 'bedep':22, 'dircrypt':23,
          'rovnix':24, 'sisron':25, 'cryptolocker':26, 'fobber_v1':27, 'chinad':28,
          'padcrypt':29, 'simda':30}

predict_labels = {v: k for k, v in labels.items()}


class MyHandler(BaseHandler):

    def __init__(self):
        super(MyHandler, self).__init__()
        self.initialized = False
        self.context = None
        self.model = None
        self.device = torch.device("cpu")
        self.num_labels = 31
        self.domains = None

    def initialize(self, context):

        #context contains model server system properties
        self.manifest = context.manifest
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        logger.info(f"model_dir={model_dir}")
        #serialized bert dga classifier
        serialized_file = self.manifest['model']['serializedFile']
        model_pt_path = os.path.join(model_dir, serialized_file)
 
        #make sure is path
        if not os.path.isfile(model_pt_path):
            raise RuntimeError("Missing the model.pt file")
        
        #download bert model
        model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=self.num_labels)
        model.to(self.device)

        #load fine-tuned state dict
        model.load_state_dict(torch.load(model_pt_path, map_location=self.device))
        self.model = model
        self.model.to(self.device)
        self.model.eval()

        self.initialized = True


    def preprocess(self, request):
        
        """Tokenize the input text using the suitable tokenizer and convert it to tensor

        Args:
            requests: A list containing a dictionary, might be in the form
            of [{'body': json_file}] or [{'data': json_file}]
        """
        logger.info(f"REQUEST_MADE={request}")
        # unpack the data
        self.domains = [r.get("domain") for r in request]

        # tokenize the texts
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        tokenized_text = [tokenizer.encode(domain, 
                                           truncation=True, 
                                           add_special_tokens=True, 
                                           max_length=20, 
                                           pad_to_max_length=True) for domain in self.domains]
        inputs = torch.LongTensor(tokenized_text)

        return inputs

    def inference(self, inputs):
        # Perform inference with the model
        outputs = self.model(inputs)
        return outputs

    def postprocess(self, results):
        # Implement any necessary postprocessing logic; get preds
        prediction = torch.argmax(results.logits, dim=1)
        result = [predict_labels[pred.item()] for pred in prediction]
        output = [{d:r} for d,r in zip(self.domains, result)]
        return output

Finally, --export-path is the name of the folder that you will export MAR file.

Now that we are able to create a MAR file using torch-model-archiver, we can actually deploy it using torchserve (See the FROM line below in the Dockerfile). In combination with the torch-model-archiver, we can run all of this in a docker container that can be served on Vertex AI. Here is the Dockerfile:

FROM pytorch/torchserve:latest-cpu

# install dependencies
RUN pip3 install transformers

# copy model artifacts, custom handler and other dependencies
COPY ./handler.py /home/model-server/
COPY ./my_model /home/model-server/
COPY ./bert_dga_classifier.pt /home/model-server/


# create torchserve configuration file
USER root
RUN printf "\nservice_envelope=json" >> /home/model-server/config.properties
RUN printf "\ninference_address=http://0.0.0.0:7080" >> /home/model-server/config.properties
RUN printf "\nmanagement_address=http://0.0.0.0:7081" >> /home/model-server/config.properties
USER model-server

# expose health and prediction listener ports from the image
EXPOSE 7080
EXPOSE 7081


# create model archive file packaging model artifacts and dependencies
RUN torch-model-archiver -f --model-name=model --version=1.0 \
  --model-file=/home/model-server/model.safetensors \
  --serialized-file=/home/model-server/bert_dga_classifier.pt \
  --handler=/home/model-server/handler.py \
  --extra-files "/home/model-server/config.json" \
  --export-path=/home/model-server/model-store


# run Torchserve HTTP serve to respond to prediction requests
CMD ["torchserve", \
     "--start", \
     "--ts-config=/home/model-server/config.properties", \
     "--models", \
     "model=model.mar", \
     "--model-store", \
     "/home/model-server/model-store"]

Finally, here is the shell file used to build and run the container that can be uploaded to GCP’s container registry:

#!/bin/bash

set -e

# Build the image based on the Dockerfile
docker build -t dga-deployed -f Dockerfile .

# Run Container
docker run --rm -it -p 7080:7080 -p 7081:7081  --platform=linux/amd64 --name dga-deployed dga-deployed

At this point, GCP makes it pretty easy to upload and deploy the model to an endpoint providing all the necessary code here and here.

Finally, the following function can be used to make predictions:

from typing import Dict, List, Union

from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value


def predict_custom_trained_model_sample(
    project: str,
    endpoint_id: str,
    instances: Union[Dict, List[Dict]],
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    """
    `instances` can be either single instance of type dict or a list
    of instances.
    """
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
    # The format of each instance should conform to the deployed model's prediction input schema.
    instances = instances if isinstance(instances, list) else [instances]
    instances = [
        json_format.ParseDict(instance_dict, Value()) for instance_dict in instances
    ]
    parameters_dict = {}
    parameters = json_format.ParseDict(parameters_dict, Value())
    endpoint = client.endpoint_path(
        project=project, location=location, endpoint=endpoint_id
    )
    response = client.predict(
        endpoint=endpoint, instances=instances, parameters=parameters
    )
    print("response")
    print(" deployed_model_id:", response.deployed_model_id)
    # The predictions are a google.protobuf.Value representation of the model's predictions.
    predictions = response.predictions
    for prediction in predictions:
        print(" prediction:", dict(prediction))

STEP 9: CREATING A FRONTEND

FastAPI and React make a powerful combination when developing a modern web application. This project utilizes a FastAPI backend to expose the model endpoint. A user friendly React app was built that let’s users enter a domain manually or upload a file. The image below captures a version of the frontend.

STEP 10 — Scaling with Kubernetes and GitHub Actions (CI/CD)

Finally, we note the utility of deploying containers on Kubernetes and using continuous integration/continuous (CI/CD) development as part of MLOPs. Kubernetes helps you organize, deploy, and manage containers effortlessly. When more people start using your app, Kubernetes automatically adds more containers to handle the load, and when things slow down, it scales down to save resources. It’s like having a team of workers that can adjust their numbers based on the workload.

At last, implementing CI/CD methodologies to thoroughly test, execute, monitor, deploy, and scale these elements is of great importance. These processes are automated via workflows using GitHub Actions that allow the initiation of deployments or other pipelines through GitHub Events. The YAML files are located under .github/workflows, centralizing these operations. These files document the comprehensive deployment strategies incorporated within the GitHub repository, ensuring consistently efficient and dependable automated deployments.

CONCLUSION

We hope this Medium article is a great introduction to MLOPs for our readers. If you are interested in continuing your learning journey, we recommend that you read the following article that was useful for us as we embarked on this project.

MLOPS in 10 Steps: Building and Deploying an App on GCP to Classify Ransomware Actors’ Domain Generation Algorithms

Written by Rob Chavez