CDK for ML: Infrastructure as Code for AI Teams

The Construct Gap Nobody Filled

AWS CDK constructs come in three levels. L1 is a one-to-one mapping to CloudFormation: every property, no opinions, no help. L2 wraps that with sane defaults and convenience methods, which is where most of the official library lives. L3 is the level where you encode an entire pattern, the kind of thing a team would otherwise rebuild from scratch in every project, into a single construct with a handful of parameters.

For web infrastructure, the L3 ecosystem is rich. There are battle-tested constructs for static sites, serverless APIs, container services. For AI and ML on AWS, the cupboard is close to empty. If you want a Bedrock Knowledge Base, a SageMaker endpoint with autoscaling, or a model-serving proxy, you are wiring L1 resources together by hand, reading the same three documentation pages every time, and rediscovering the same gotchas.

So I started building aws-ai-cdk-constructs, a Python library of opinionated L3 constructs for exactly these patterns. The first one I shipped is BedrockKnowledgeBase, and it turned out to be a perfect example of why this layer is worth building. The construct hides a problem that has nothing to do with Bedrock and everything to do with how CloudFormation thinks about the world.

What a Knowledge Base Actually Requires

On the Bedrock console, creating a Knowledge Base feels like a five-minute task. You pick a data source, choose an embedding model, and click through a wizard. Underneath, the wizard is provisioning a small constellation of resources and ordering them carefully. Reproduce that in CloudFormation and the five-minute task becomes a few hundred lines that have to be exactly right.

Here is the full set of things that have to exist before a single document gets embedded:

An S3 bucket for the source documents.
An OpenSearch Serverless collection of type VECTORSEARCH to hold the vectors.
Three separate OpenSearch Serverless policies: an encryption policy, a network policy, and a data-access policy. Miss any one and the collection either won't create or won't accept writes.
A vector index inside that collection, with a field mapping the Knowledge Base expects.
A scoped IAM service role the Knowledge Base assumes, with permission to invoke the embedding model, read the bucket, and reach the collection.
The Knowledge Base itself, plus a data source that points back at the bucket with a chunking strategy.

Most of that is tedious but mechanical. One item on the list is genuinely hard, and it is the vector index.

The Vector Index Problem

When Bedrock creates a Knowledge Base, it validates that the vector index already exists in the collection. If the index isn't there, creation fails. Fine, you think, I'll just declare the index in CloudFormation alongside everything else.

You can't. There is no CloudFormation resource type for an OpenSearch Serverless index. The collection is a control-plane resource that CloudFormation manages, but the index lives on the data plane, behind the collection's own HTTPS endpoint, reachable only with a SigV4-signed request. CloudFormation has no concept of it.

This is the chicken-and-egg that traps most people writing Bedrock infrastructure as code. The Knowledge Base needs the index. The index can't be expressed in the template. The two usual escape hatches are both unsatisfying. You either create the index by hand in the console before every deploy, which defeats the entire point of infrastructure as code, or you write a bespoke Lambda-backed custom resource and reach for a heavy dependency to talk to OpenSearch, which means bundling and usually Docker in your build.

"The hardest part of a Knowledge Base isn't Bedrock. It's a single index that lives on the wrong side of the CloudFormation boundary."

Solving It Without Docker

The construct creates the index with a custom resource, but the interesting decision is what that custom resource is allowed to depend on. Most examples online install opensearch-py or requests-aws4auth into the Lambda, which forces a bundling step. On Windows, or in a thin CI runner, that means standing up Docker just to package one tiny function. I did not want a library that demands Docker to synthesize.

So the index Lambda uses nothing but boto3 and botocore, both already present in the Lambda runtime. It signs a plain HTTPS request to the collection endpoint with SigV4 and sends the index mapping as raw JSON over urllib. No pip install, no layer, no Docker. The asset is a single handler file.

def _signed(method, url, body):
    session = boto3.Session()
    creds = session.get_credentials().get_frozen_credentials()
    region = os.environ["AWS_REGION"]

    aws_req = AWSRequest(method=method, url=url, data=body,
                         headers={"Content-Type": "application/json"})
    # SigV4Auth derives the Host header from the URL and adds the auth headers.
    SigV4Auth(creds, "aoss", region).add_auth(aws_req)

    req = urllib.request.Request(url, data=body.encode() if body else None,
                                 method=method)
    for key, value in aws_req.headers.items():
        req.add_header(key, value)
    return req

The mapping it sends is the part that has to match what Bedrock expects: a knn_vector field sized to the embedding model, plus the text and metadata fields, using the same default names the console uses so the construct stays compatible with anything built the wizard way.

{
  "settings": {"index": {"knn": true}},
  "mappings": {"properties": {
    "bedrock-knowledge-base-default-vector": {
      "type": "knn_vector",
      "dimension": 1024,
      "method": {"name": "hnsw", "engine": "faiss", "space_type": "l2"}
    },
    "AMAZON_BEDROCK_TEXT_CHUNK": {"type": "text"},
    "AMAZON_BEDROCK_METADATA": {"type": "text", "index": false}
  }}
}

The Timing Trap

There is a second, sneakier problem hiding inside the first one. Even after CloudFormation reports the data-access policy and the collection as created, OpenSearch Serverless takes a little while to make them consistent. Fire the index request too early and you get an authorization failure for a collection that, on paper, already exists and already grants you access.

The handler retries with a delay rather than assuming the platform is ready the instant CloudFormation says so. It also treats a resource_already_exists_exception as success, so re-running a deploy is idempotent instead of explosive. Then it waits once more before returning, because the Knowledge Base will validate the index the moment the custom resource completes, and the index needs a beat to become queryable.

This is the value an L3 construct captures. None of this reasoning belongs in your application stack. You should declare that you want a Knowledge Base and get one. The hidden ordering, the SigV4 dance, the eventual-consistency retries, all of it should live once, in a construct, tested, and never thought about again.

What the Caller Sees

After all that, the entire surface area for a consumer is one construct and a few keyword arguments. Everything above happens behind it.

from aws_ai_cdk_constructs import BedrockKnowledgeBase

kb = BedrockKnowledgeBase(
    self, "DocsKb",
    embedding_model_id="amazon.titan-embed-text-v2:0",
    embedding_dimension=1024,
    chunking_max_tokens=300,
    chunking_overlap_percentage=20,
)

CfnOutput(self, "KnowledgeBaseId", value=kb.knowledge_base_id)
CfnOutput(self, "DataBucket", value=kb.data_bucket.bucket_name)

Pass nothing and you get a private encrypted bucket created for you. Pass an existing data_bucket and the construct ingests from it instead. The IAM role is scoped to exactly three things: invoke the one embedding model, read the one bucket, reach the one collection. No wildcard policies, no AdministratorAccess shortcut that quietly ships to production.

Testing Infrastructure You Haven't Deployed

CDK has a property worth leaning on here. Because a stack synthesizes to a CloudFormation template, you can assert on the template without ever calling AWS. The construct ships with synthesis tests that confirm it emits the right shape: one collection of type VECTORSEARCH, two security policies and one access policy, a Knowledge Base with the correct field mapping, a data source, and a dashboard you can toggle off.

def test_creates_all_three_oss_policies():
    template = _template()
    template.resource_count_is(
        "AWS::OpenSearchServerless::SecurityPolicy", 2)
    template.resource_count_is(
        "AWS::OpenSearchServerless::AccessPolicy", 1)

Because the index Lambda is pure Python with no bundled dependencies, these tests run in a plain CI job with no Docker and no AWS credentials. That is a direct payoff of the no-dependency decision earlier. The choice that simplified the runtime also simplified the test pipeline.

I want to be precise about what this proves and what it doesn't. Synthesis tests prove the template is correct. They do not prove a live deploy succeeds, and the riskiest part of this construct, the SigV4 index call against a real collection and the consistency timing around it, is exactly the part synthesis can't exercise. That validation against a real account is the next step before I'd call the construct production-blessed rather than production-shaped.

Stubs With Honest APIs

The library lists four constructs. One is real. The other three, a SageMaker autoscaling endpoint, a Lambda and API Gateway model proxy, and an S3 plus CloudFront model-artifact distribution, ship as stubs. Their constructors and parameters are fully defined and documented, but instantiating one raises NotImplementedError with a link to the tracking issue.

That is a deliberate choice rather than laziness. Publishing the API shape first does two things. It lets the eventual implementation land without breaking anyone who coded against the signature, and it makes the roadmap legible: you can read the constructor for SageMakerEndpoint and see that target-tracking autoscaling on invocations-per-instance is coming, with data capture as an opt-in. A stub that lies about being finished is worse than no construct. A stub that is honest about being a placeholder, while locking the interface, is a contract.

Construct	Status	What it provisions
`BedrockKnowledgeBase`	Implemented	Collection, index, policies, scoped IAM, dashboard
`SageMakerEndpoint`	Stub	Autoscaling real-time inference endpoint
`AiProxy`	Stub	API Gateway and Lambda in front of a model
`ModelArtifacts`	Stub	Versioned S3 plus CloudFront for model assets

Why This Layer Matters for AI Teams

The teams shipping AI on AWS right now are mostly not infrastructure teams. They are application and ML people who need a Knowledge Base or an endpoint to exist so they can get back to the actual work. Every hour they spend learning that OpenSearch Serverless needs three policies, or debugging why their Knowledge Base won't create because an index they can't declare doesn't exist yet, is an hour stolen from the thing they were hired to do.

L3 constructs move that knowledge out of people's heads and into code that runs. The first one in this library happens to encode a fairly nasty CloudFormation boundary problem. The next ones will encode their own. The point of the project is that you should be able to ask for AI infrastructure the way you ask for a static website today, by naming it, and trust that someone already paid the cost of getting the details right.

The source, the full BedrockKnowledgeBase implementation, the index custom resource, and a deployable example all live at github.com/ivandir/aws-ai-cdk-constructs. It is MIT licensed and early. If you have wired up a Bedrock Knowledge Base the hard way, the index trick alone might save you an afternoon.