Automating the creation of your own RAG System with Terraform, AWS Kendra and AWS Sagemaker (Part One)

5 min readJul 4, 2024

Like most companies, it’s very likely that your company also has lots of internal documentation. Now, with the advent of Generative AI, it has never been easier to search those internal documents. With RAG, Retrieval Augmented Generation, we can build entire chatbots relying on state-of-the-art Large Language Models like Llama3. And using services like AWS Kendra and AWS Sagemaker we can easily deploy these solutions to the cloud.

Now, while we could do all these steps manually, I am a big fan of having all of your infrastructure set up as code. Using tools like terraform, we can easily create/modify/destroy our infrastructure as necessary. I will also be using Gitlab and its Managed Terraform State feature to easily integrate my code repository with my terraform state. I will not go into much details regarding these steps, since I already covered similar things in one of my previous articles: Using Gitlab to manage Multi Environment Terraform State

I have split the set up for this article into three parts:

kendra
sagemaker
lambda

So let’s start with Part One.

AWS Kendra Set Up

From AWS:

Kendra is an intelligent enterprise search service that helps you search across different content repositories with built-in connectors

So AWS Kendra will be responsible for indexing our internal documents.

Let’s start the terraform set up

backend.tf

terraform {
  backend "http" {
  }
}

The configuration of the backend will be done later by the gitlab pipeline.

provider.tf

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.56.1"
    }
    awscc = {
      source  = "hashicorp/awscc"
      version = "~> 1.4.0"
    }
  }
}

variables.tf

variable "files" {
  type = list(string)
  default = []
}

variable "aws_region" {
  type = string
}

Also, let’s define two variables: the aws region and a variable called files that will be used later to automatically upload data to S3 (this approach serves only as an example, since AWS Kendra provides multiple data sources).

main.tf

We start by creating a S3 Bucket and uploading some files to it. Again, this serves only as an example, since you might have more complex scenarios with different data sources.

resource "aws_s3_bucket" "resources-bucket" {
  bucket = "s3-kendra-resources"
}

resource "aws_s3_object" "object" {
  for_each = { for f in var.files : f => f }
  bucket = aws_s3_bucket.resources-bucket.id
  key      = basename(each.value)
  source = each.value

  etag = filemd5(each.value)
}

Next, let define the AWS Kendra Index and respective IAM roles and policies:

resource "awscc_kendra_index" "kendra_index" {
  edition     = "ENTERPRISE_EDITION"
  name        = "kendra-index"
  role_arn    = awscc_iam_role.kendra_iam_role.arn
  description = "Kendra index"
}

resource "awscc_iam_role" "kendra_iam_role" {
  role_name   = "kendra_iam_role"
  assume_role_policy_document = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "kendra.amazonaws.com"
        }
      }
    ]
  })
  max_session_duration = 7200
}

resource "awscc_iam_role_policy" "kendra_iam_role_policy" {
  policy_name = "kendra_role_policy"
  role_name   = awscc_iam_role.kendra_iam_role.id

  policy_document = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = "cloudwatch:PutMetricData"
        Resource = "*"
        Condition = {
          "StringEquals" : {
            "cloudwatch:namespace" : "AWS/Kendra"
          }
        }
      },
      {
        Effect   = "Allow"
        Action   = "logs:DescribeLogGroups"
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = "kendra:BatchDeleteDocument",
        Resource = "${awscc_kendra_index.kendra_index.arn}"
      },
      {
        Effect   = "Allow"
        Action   = "logs:CreateLogGroup",
        Resource = "arn:aws:logs:${var.aws_region}:${data.aws_caller_identity.current.account_id}:log-group:/aws/kendra/*"
      },
      {
        Effect = "Allow"
        Action = [
          "logs:DescribeLogStreams",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ],
        Resource = "arn:aws:logs:${var.aws_region}:${data.aws_caller_identity.current.account_id}:log-group:/aws/kendra/*:log-stream:*"
      },
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:ListBucket"
        ],
        Resource = [
          aws_s3_bucket.resources-bucket.arn,
          "${aws_s3_bucket.resources-bucket.arn}/*"
        ]
      },
    ]
  })
}

data "aws_caller_identity" "current" {}

Finally, let’s create our data source which, for this example, will be of type S3 . We will also generate two outputs that will be used later on our gitlab pipeline to automatically trigger the sync job of the of AWS Kendra Data source:

resource "awscc_kendra_data_source" "kendra_datasource_s3" {
  index_id        = awscc_kendra_index.kendra_index.id
  name            = "kendra-datasource-s3"
  role_arn        = awscc_iam_role.kendra_iam_role.arn
  type            = "S3"

  data_source_configuration = {
    s3_configuration = {
      bucket_name = aws_s3_bucket.resources-bucket.bucket
    }
  }
}


output "kendra_index_id" {
  value = awscc_kendra_index.kendra_index.id
}

output "kendra_datasource_s3_id" {
  value = awscc_kendra_data_source.kendra_datasource_s3.id
}

So now, to finish Part One, let’s define our gitlab pipeline file: .gitlab-ci.yml

image: registry.gitlab.com/gitlab-org/terraform-images/stable:latest

variables:
  STATE_NAME: kendra
  TF_ROOT: ${CI_PROJECT_DIR}
  TF_ADDRESS: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/${STATE_NAME}_state
  TF_VAR_aws_region: ${AWS_DEFAULT_REGION}

cache:
  key: ${STATE_NAME}_state
  paths:
    - ${TF_ROOT}/.terraform

before_script:
  - cd ${TF_ROOT}

stages:
  - prepare
  - validate
  - build
  - deploy
  - post-deploy

init:
  stage: prepare
  script:
    - gitlab-terraform init 

validate:
  stage: validate
  script:
    - gitlab-terraform validate

plan:
  stage: build
  script:
    - chmod +x ./list_files.sh
    - export TF_VAR_files=$(./list_files.sh)
    - gitlab-terraform plan
    - gitlab-terraform plan-json
  artifacts:
    name: plan
    paths:
      - ${TF_ROOT}/plan.cache
    reports:
      terraform:  ${TF_ROOT}/plan.json

apply:
  stage: deploy
  script:
    - gitlab-terraform apply
    - echo "KENDRA_INDEX_ID=$(gitlab-terraform  output -raw kendra_index_id)" >> build.env
    - echo "KENDRA_DATASOURCE_S3_ID=$(gitlab-terraform output -raw kendra_datasource_s3_id | cut -d'|' -f1)" >> build.env
  artifacts:
    reports:
      dotenv: build.env
  dependencies:
    - plan
  when: manual

destroy:
  stage: deploy
  script:
    - gitlab-terraform destroy
  dependencies:
    - plan
  when: manual

sync:
  stage: post-deploy
  image: registry.gitlab.com/gitlab-org/cloud-deploy/aws-base:latest
  before_script: 
    - aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID
    - aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
    - aws configure set default.region $AWS_DEFAULT_REGION
  script:
    - aws kendra start-data-source-sync-job --id $KENDRA_DATASOURCE_S3_ID --index-id $KENDRA_INDEX_ID
  needs:
    - job: apply
      artifacts: true 
  when: manual

We start by defining a couple of variables including the state address for our infrastructure which can be built upon default environment variables set by gitlab. We also set the aws_region terraform variable based on an environment variable AWS_DEFAULT_REGION. The init and validate jobs are straightforward, so let’s skip them and go to the plan job, where we are using a simple script to set the TF_VAR_files (assuming your files are under a folder named resources ):

#!/bin/sh
directory="./resources"
files=$(find $directory -type f | jq -R -s -c 'split("\n")[:-1]')
echo $files

After the plan job, the apply job executes apply and exports the KENDRA_INDEX_ID and KENDRA_DATASOURCE_S3_ID to any other job that may need it. Finally, our sync job will use those two variables to trigger an AWS Kendra Datasource sync job. After the job completion, our AWS Kendra index will be ready to be used.

In Part two of this series, we will create and deploy a Llama3 model from Huggingface to AWS Sagemaker.