Automating the creation of your own RAG System with Terraform, AWS Kendra and AWS Sagemaker (Part Two)

3 min readJul 6, 2024

So this is Part two of a three parts series that I will walk though how to use AWS Kendra and Sagemaker to deploy a RAG system. You can find Part One here.

On this article, I will be covering how we can can use AWS Sagemaker to deploy a Llama3 model from Huggingface.

AWS Sagemaker is a cloud-based machine learning platform that enables developers to create, train, and deploy machine learning models on the cloud. It also supports the deployment of ML models on embedded systems and edge devices [source].

Basically, we will leverage Sagemaker capabilities to easily deploy a production ready Large Language Model.

AWS Sagemaker Set Up

Let’s start with the variables necessary for the terraform scripts:

variables.tf

Again, we will have a variable to define the aws region that our infrascructure will be deployed to. This time we also define a hf_api_token which is the Access Token from the Huggingface, if you want to follow through, the account associated with the hf_api_token used also needs to have access to the the Llama 3 Instruct model.

variable "aws_region" {
  type = string
}

variable "hf_api_token" {
  type = string
}

provider.tf

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

backend.tf

terraform {
  backend "http" {
  }
}

main.tf

resource "aws_sagemaker_model" "hf_model" {
  name               = "hfmodel"
  execution_role_arn = aws_iam_role.sagemaker_iam_role.arn

  primary_container {
    image = "763104351884.dkr.ecr.ap-southeast-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi2.0.0-gpu-py310-cu121-ubuntu22.04-v2.0"
    environment = {
      HF_TASK           = "text-generation"
      HF_MODEL_ID       = "meta-llama/Meta-Llama-3-8B-Instruct"
      HF_API_TOKEN      = var.hf_api_token
    }
  }
}

resource "aws_iam_role" "sagemaker_iam_role" {
  name = "sagemaker_iam_role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "sagemaker.amazonaws.com"
        }
      },
    ]
  })

  inline_policy {
    name = "terraform-inferences-policy"
    policy = jsonencode({
      Version = "2012-10-17"
      Statement = [
        {
          Effect = "Allow",
          Action = [
            "cloudwatch:PutMetricData",
            "logs:CreateLogStream",
            "logs:PutLogEvents",
            "logs:CreateLogGroup",
            "logs:DescribeLogStreams",
            "s3:GetObject",
            "s3:PutObject",
            "s3:ListBucket",
            "ecr:GetAuthorizationToken",
            "ecr:BatchCheckLayerAvailability",
            "ecr:GetDownloadUrlForLayer",
            "ecr:BatchGetImage"
          ],
          Resource = "*"
        }
      ]
    })
  }
}

On our aws_sagemaker_model’s primary container configuration we define the image that will be used to deploy our model. Here, we are using Huggingface’s Text Generation Inference toolkit which let us deploy production ready Large Language Models. We also define our model, the task (or pipeline) and our personal access token.

resource "aws_sagemaker_endpoint_configuration" "huggingface_endpoint_conf" {
  name  = "hf-model-endpoint-conf"
  production_variants {
    variant_name = "AllTraffic"
    model_name   = aws_sagemaker_model.hf_model.name

    initial_instance_count = 1
    instance_type          = "ml.g5.2xlarge"
  }
}

resource "aws_sagemaker_endpoint" "huggingface" {
  name = "myendpoint"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.huggingface_endpoint_conf.name
}

We then create the endpoint for our model. This is the instance that will be used to run our model. We can define the instance type as well as the number of instances necessary.

And that is it for our model’s terraform configuration.

So now, as the last step, we will define our .gitlab-ci.yml file that will create our gitlab pipeline:

image: registry.gitlab.com/gitlab-org/terraform-images/stable:latest

variables:
  STATE_NAME: sagemaker
  TF_ROOT: ${CI_PROJECT_DIR}
  TF_ADDRESS: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/${STATE_NAME}_state
  TF_VAR_aws_region: ${AWS_DEFAULT_REGION}
  TF_VAR_hf_api_token: ${HF_TOKEN}

cache:
  key: ${STATE_NAME}_state
  paths:
    - ${TF_ROOT}/.terraform

before_script:
  - cd ${TF_ROOT}

stages:
  - prepare
  - validate
  - build
  - deploy

init:
  stage: prepare
  script:
    - gitlab-terraform init 

validate:
  stage: validate
  script:
    - gitlab-terraform validate

plan:
  stage: build
  script:
    - gitlab-terraform plan
    - gitlab-terraform plan-json
  artifacts:
    name: plan
    paths:
      - ${TF_ROOT}/plan.cache
    reports:
      terraform:  ${TF_ROOT}/plan.json

apply:
  stage: deploy
  script:
    - gitlab-terraform apply
  dependencies:
    - plan
  when: manual

destroy:
  stage: deploy
  script:
    - gitlab-terraform destroy
  dependencies:
    - plan
  when: manual

Not much to say about, since this is even simpler than the pipeline from Part One. We are simply validating the terraform scripts and then calling apply to create the resources in AWS.

Once completed, your model should be ready to start making inferences.

We are almost there now, just one more step to go. In Part Three of this series, I will show you how to deploy an AWS lambda function that uses langchain to combine both services, the kendra index and the LLM model, so you can start querying your internal documentation with easy.

Resources: There are two particular resources that really helped me when setting up this part of the project, so a big thanks to Philipp Schmid:

https://github.com/philschmid/terraform-aws-sagemaker-huggingface/blob/v0.9.0/main.tf

https://www.philschmid.de/sagemaker-llama-llm