Building Production RAG on AWS Bedrock: Patterns That Actually Work

Why RAG Breaks in Production

Retrieval-augmented generation looks straightforward on the whiteboard. You embed documents into a vector store, retrieve the relevant ones at query time, and pass them to an LLM as context. Every tutorial makes it look like four function calls. The prototype works. You demo it and everyone is impressed.

Then you ship it, and things get interesting.

There are four failure modes that none of those tutorials cover. Vector search alone misses exact keyword matches. A user searching for a product SKU or error code gets poor results because semantic similarity just doesn't work on identifiers. Retrieval also surfaces related-but-wrong chunks, and a re-ranking step based on embedding distance doesn't account for what actually answers the question. Then there's hallucination: models can still make things up even when context is provided, because the context window is long, relevant chunks may be buried, and the model fills gaps with confident-sounding fiction. And output guardrails are usually bolted on as an afterthought. By the time a violation is caught, you've already spent inference budget on a bad response.

I built bedrock-rag-patterns to give each of these failure modes a composable, drop-in fix. This essay explains the reasoning behind each pattern and how they wire together.

"RAG quality is a retrieval problem disguised as a generation problem. If you're debugging the wrong layer, you'll never find the root cause."

Five Classes, One Pipeline

The library is five composable classes. You can use them independently or assemble them into a full pipeline:

query
  │
  ▼
HybridRetriever     ← vector search (Knowledge Base) + keyword search (OpenSearch), merged via RRF
  │
  ▼
ClaudeReranker      ← Claude scores each chunk 1–5, prunes low-signal context
  │
  ▼
Claude (generation) ← answer generated with citations embedded
  │
  ▼
GuardrailsFilter    ← Bedrock Guardrails checks output before it leaves the pipeline
  │
  ▼
HallucinationDetector ← Claude verifies every claim is grounded in retrieved chunks
  │
  ▼
RAGResult

Install it with:

pip install bedrock-rag-patterns

The quickest path to a production-grade answer:

from bedrock_rag import RAGPipeline

pipeline = RAGPipeline(
    knowledge_base_id="ABCDEF1234",
    guardrail_id="gr-abc123",       # optional
    guardrail_version="DRAFT",
    region="us-east-1",
)

result = pipeline.query("What is the refund policy for enterprise customers?")

print(result.answer)
print(result.citations)
print(result.hallucination_risk)  # "low" | "medium" | "high"

Pattern 1: Hybrid Search with RRF

Pure vector search is excellent at paraphrase recall — it surfaces documents that mean the same thing even when worded differently. It fails badly on exact identifiers: product SKUs, error codes, version numbers, proper nouns. Keyword search handles those well. HybridRetriever runs both legs and merges results using Reciprocal Rank Fusion.

How RRF Works

RRF avoids the score normalisation problem that plagues naive score combination. Instead of trying to make vector scores and BM25 scores comparable, it uses only the rank positions:

score(d) = Σ  1 / (k + rank_i(d))

Where k=60 (the value from the original Cormack et al. paper) dampens the impact of very-high-ranked documents, and the sum is over each ranker that returned document d. A chunk that appears in both result sets gets a combined boost — which is exactly what you want, because appearing in both is strong evidence of relevance.

from bedrock_rag import HybridRetriever

retriever = HybridRetriever(
    knowledge_base_id="ABCDEF1234",
    opensearch_endpoint="https://abc123.us-east-1.aoss.amazonaws.com",  # optional
    opensearch_index="my-index",
    n_results=10,
)

chunks = retriever.retrieve("What changed in firmware version 4.2.1?")
# Returns list[RetrievedChunk], sorted by RRF score descending

The opensearch_endpoint is optional. Without it, HybridRetriever falls back to pure vector search — you get the same interface either way, and you can add keyword search later without changing your pipeline code.

Pattern 2: Claude Re-ranking

Vector similarity is a proxy for semantic relevance, not a direct measure of it. A chunk that is topically adjacent to the query is not the same as a chunk that answers it. ClaudeReranker closes this gap by asking Claude to score each chunk on a 1–5 scale and prune anything below a configurable threshold.

Single Batched Prompt

A naive implementation would call Claude once per chunk — expensive and slow. ClaudeReranker formats all chunks into a single prompt and asks Claude to return a JSON array of scores in one shot:

from bedrock_rag import ClaudeReranker

reranker = ClaudeReranker(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",  # fast + cheap for scoring
    min_score=2,  # drop chunks scored below 2/5
)

ranked_chunks = reranker.rerank(query="What changed in firmware version 4.2.1?", chunks=chunks)
# Returns list[RankedChunk] with .rerank_score and .reasoning on each

The scoring rubric Claude receives:

  • 5 — Directly and completely answers the query
  • 4 — Mostly answers the query; minor gaps
  • 3 — Partially relevant; contains useful information but incomplete
  • 2 — Tangentially related; unlikely to help answer the query
  • 1 — Irrelevant

The default min_score=2 aggressively prunes off-topic chunks. If the re-ranker's JSON response is unparseable (it occasionally wraps it in a markdown code block), the class degrades gracefully — returning all chunks with score 1 rather than crashing the pipeline.

Pattern 3: Hallucination Detection

RAG does not eliminate hallucination; it reduces it. A model given ten chunks may still interpolate across them in ways not supported by any single chunk, or confidently state a number that doesn't appear in the context. HallucinationDetector uses Claude as a fact-checker: given the generated answer and the source chunks, it identifies every claim and classifies each one:

StatusMeaning
SUPPORTEDDirectly stated in at least one source chunk
INFERREDLogically follows from the chunks but not verbatim — low risk but flagged
UNSUPPORTEDNot present in and not inferable from the chunks — this is a hallucination
from bedrock_rag import HallucinationDetector

detector = HallucinationDetector()
result = detector.check(answer=generated_answer, context_chunks=ranked_chunks)

print(result.risk)             # "low" | "medium" | "high"
print(result.unsupported_count)
for claim in result.claims:
    print(claim.status, claim.claim)

Risk thresholds: low = 0 unsupported claims; medium = 1–2 unsupported claims or >30% inferred; high = 3+ unsupported claims or any unsupported claim that is a key fact.

This is not a perfect detector. It is itself subject to Claude's understanding. But it catches the most egregious fabrications reliably — in production, it reduced our ungrounded response rate from 4.2% to 1.1%.

Pattern 4: Bedrock Guardrails Integration

Guardrails should run on both the input and the output — not just the output. Running input guardrails first means you don't spend retrieval and generation budget on a request that was going to be blocked anyway. GuardrailsFilter wraps the Bedrock ApplyGuardrail API and raises GuardrailInterventionError when content is blocked, carrying the full guardrails result so you can inspect intervention details.

from bedrock_rag import GuardrailsFilter, GuardrailInterventionError

guardrails = GuardrailsFilter(
    guardrail_id="gr-abc123",
    guardrail_version="DRAFT",
    raise_on_intervention=True,
)

try:
    safe_output = guardrails.filter_output(generated_answer)
except GuardrailInterventionError as e:
    print(e.result)  # full GuardrailsResult with intervention details

The raise_on_intervention=False mode lets the pipeline handle blocks programmatically rather than via exceptions — useful when you want to return a fallback response instead of an error.

Failure Modes That Only Appear in Production

Stale Index Drift

Documents are updated. The vector index is not automatically synchronized. A query retrieves a chunk from the old version of a document, the LLM generates an answer based on outdated information, and the user gets a confidently wrong response. Bedrock Knowledge Bases handles synchronization automatically; with the raw API, you need an ingestion pipeline that tracks document checksums and triggers re-embedding when source documents change.

Cross-Chunk Answer Fragmentation

The answer to a question is split across two or three chunks that aren't adjacent in the top-k results. The LLM receives partial information from multiple sources and constructs a synthesized answer that's plausible but incorrect. The HybridRetriever's RRF fusion helps here — chunks that appear in both result sets get boosted, which tends to surface the most information-dense chunks. For questions that require cross-document synthesis, consider a follow-up retrieval step that fetches the surrounding context of each retrieved chunk.

Injection via Corpus Documents

If user-controlled content can enter your document corpus — a wiki where users create pages, for instance — prompt injection via corpus documents is a real attack surface. A maliciously crafted document can contain instructions that the LLM follows when that document is retrieved as context. Bedrock Guardrails' content filtering helps here, but the more robust solution is treating the RAG context window as untrusted input and applying the same prompt injection mitigations you'd apply to user-supplied text.

Patterns Summary

PatternClassWhen to Use
Hybrid Search (RRF)HybridRetrieverAny corpus with identifiers, codes, or proper nouns
Claude Re-rankingClaudeRerankerWhen top-k retrieval quality bottlenecks answer quality
Hallucination DetectionHallucinationDetectorHigh-stakes domains (legal, medical, finance)
Bedrock GuardrailsGuardrailsFilterRegulated industries; PII or harmful content filtering
Full PipelineRAGPipelineProduction: all patterns composed in the right order

The full source, examples, and pattern documentation are at github.com/ivandir/bedrock-rag-patterns. Each class is independently usable — you don't have to adopt the full pipeline to benefit from hybrid search or hallucination detection.