Why RAG Breaks in Production
Retrieval-augmented generation looks straightforward on the whiteboard. You embed documents into a vector store, retrieve the relevant ones at query time, and pass them to an LLM as context. Every tutorial makes it look like four function calls. The prototype works. You demo it and everyone is impressed.
Then you ship it, and things get interesting.
There are four failure modes that none of those tutorials cover. Vector search alone misses exact keyword matches. A user searching for a product SKU or error code gets poor results because semantic similarity just doesn't work on identifiers. Retrieval also surfaces related-but-wrong chunks, and a re-ranking step based on embedding distance doesn't account for what actually answers the question. Then there's hallucination: models can still make things up even when context is provided, because the context window is long, relevant chunks may be buried, and the model fills gaps with confident-sounding fiction. And output guardrails are usually bolted on as an afterthought. By the time a violation is caught, you've already spent inference budget on a bad response.
I built bedrock-rag-patterns to give each of these failure modes a composable, drop-in fix. This essay explains the reasoning behind each pattern and how they wire together.
"RAG quality is a retrieval problem disguised as a generation problem. If you're debugging the wrong layer, you'll never find the root cause."
Five Classes, One Pipeline
The library is five composable classes. You can use them independently or assemble them into a full pipeline:
query
│
▼
HybridRetriever ← vector search (Knowledge Base) + keyword search (OpenSearch), merged via RRF
│
▼
ClaudeReranker ← Claude scores each chunk 1–5, prunes low-signal context
│
▼
Claude (generation) ← answer generated with citations embedded
│
▼
GuardrailsFilter ← Bedrock Guardrails checks output before it leaves the pipeline
│
▼
HallucinationDetector ← Claude verifies every claim is grounded in retrieved chunks
│
▼
RAGResult
Install it with:
pip install bedrock-rag-patterns
The quickest path to a production-grade answer:
from bedrock_rag import RAGPipeline
pipeline = RAGPipeline(
knowledge_base_id="ABCDEF1234",
guardrail_id="gr-abc123", # optional
guardrail_version="DRAFT",
region="us-east-1",
)
result = pipeline.query("What is the refund policy for enterprise customers?")
print(result.answer)
print(result.citations)
print(result.hallucination_risk) # "low" | "medium" | "high"
Pattern 1: Hybrid Search with RRF
Pure vector search is excellent at paraphrase recall — it surfaces documents that mean the same thing even when worded differently. It fails badly on exact identifiers: product SKUs, error codes, version numbers, proper nouns. Keyword search handles those well. HybridRetriever runs both legs and merges results using Reciprocal Rank Fusion.
How RRF Works
RRF avoids the score normalisation problem that plagues naive score combination. Instead of trying to make vector scores and BM25 scores comparable, it uses only the rank positions:
score(d) = Σ 1 / (k + rank_i(d))
Where k=60 (the value from the original Cormack et al. paper) dampens the impact of very-high-ranked documents, and the sum is over each ranker that returned document d. A chunk that appears in both result sets gets a combined boost — which is exactly what you want, because appearing in both is strong evidence of relevance.
from bedrock_rag import HybridRetriever
retriever = HybridRetriever(
knowledge_base_id="ABCDEF1234",
opensearch_endpoint="https://abc123.us-east-1.aoss.amazonaws.com", # optional
opensearch_index="my-index",
n_results=10,
)
chunks = retriever.retrieve("What changed in firmware version 4.2.1?")
# Returns list[RetrievedChunk], sorted by RRF score descending
The opensearch_endpoint is optional. Without it, HybridRetriever falls back to pure vector search — you get the same interface either way, and you can add keyword search later without changing your pipeline code.
Pattern 2: Claude Re-ranking
Vector similarity is a proxy for semantic relevance, not a direct measure of it. A chunk that is topically adjacent to the query is not the same as a chunk that answers it. ClaudeReranker closes this gap by asking Claude to score each chunk on a 1–5 scale and prune anything below a configurable threshold.
Single Batched Prompt
A naive implementation would call Claude once per chunk — expensive and slow. ClaudeReranker formats all chunks into a single prompt and asks Claude to return a JSON array of scores in one shot:
from bedrock_rag import ClaudeReranker
reranker = ClaudeReranker(
model_id="anthropic.claude-3-haiku-20240307-v1:0", # fast + cheap for scoring
min_score=2, # drop chunks scored below 2/5
)
ranked_chunks = reranker.rerank(query="What changed in firmware version 4.2.1?", chunks=chunks)
# Returns list[RankedChunk] with .rerank_score and .reasoning on each
The scoring rubric Claude receives:
- 5 — Directly and completely answers the query
- 4 — Mostly answers the query; minor gaps
- 3 — Partially relevant; contains useful information but incomplete
- 2 — Tangentially related; unlikely to help answer the query
- 1 — Irrelevant
The default min_score=2 aggressively prunes off-topic chunks. If the re-ranker's JSON response is unparseable (it occasionally wraps it in a markdown code block), the class degrades gracefully — returning all chunks with score 1 rather than crashing the pipeline.
Pattern 3: Hallucination Detection
RAG does not eliminate hallucination; it reduces it. A model given ten chunks may still interpolate across them in ways not supported by any single chunk, or confidently state a number that doesn't appear in the context. HallucinationDetector uses Claude as a fact-checker: given the generated answer and the source chunks, it identifies every claim and classifies each one:
| Status | Meaning |
|---|---|
SUPPORTED | Directly stated in at least one source chunk |
INFERRED | Logically follows from the chunks but not verbatim — low risk but flagged |
UNSUPPORTED | Not present in and not inferable from the chunks — this is a hallucination |
from bedrock_rag import HallucinationDetector
detector = HallucinationDetector()
result = detector.check(answer=generated_answer, context_chunks=ranked_chunks)
print(result.risk) # "low" | "medium" | "high"
print(result.unsupported_count)
for claim in result.claims:
print(claim.status, claim.claim)
Risk thresholds: low = 0 unsupported claims; medium = 1–2 unsupported claims or >30% inferred; high = 3+ unsupported claims or any unsupported claim that is a key fact.
Pattern 4: Bedrock Guardrails Integration
Guardrails should run on both the input and the output — not just the output. Running input guardrails first means you don't spend retrieval and generation budget on a request that was going to be blocked anyway. GuardrailsFilter wraps the Bedrock ApplyGuardrail API and raises GuardrailInterventionError when content is blocked, carrying the full guardrails result so you can inspect intervention details.
from bedrock_rag import GuardrailsFilter, GuardrailInterventionError
guardrails = GuardrailsFilter(
guardrail_id="gr-abc123",
guardrail_version="DRAFT",
raise_on_intervention=True,
)
try:
safe_output = guardrails.filter_output(generated_answer)
except GuardrailInterventionError as e:
print(e.result) # full GuardrailsResult with intervention details
The raise_on_intervention=False mode lets the pipeline handle blocks programmatically rather than via exceptions — useful when you want to return a fallback response instead of an error.
Failure Modes That Only Appear in Production
Stale Index Drift
Documents are updated. The vector index is not automatically synchronized. A query retrieves a chunk from the old version of a document, the LLM generates an answer based on outdated information, and the user gets a confidently wrong response. Bedrock Knowledge Bases handles synchronization automatically; with the raw API, you need an ingestion pipeline that tracks document checksums and triggers re-embedding when source documents change.
Cross-Chunk Answer Fragmentation
The answer to a question is split across two or three chunks that aren't adjacent in the top-k results. The LLM receives partial information from multiple sources and constructs a synthesized answer that's plausible but incorrect. The HybridRetriever's RRF fusion helps here — chunks that appear in both result sets get boosted, which tends to surface the most information-dense chunks. For questions that require cross-document synthesis, consider a follow-up retrieval step that fetches the surrounding context of each retrieved chunk.
Injection via Corpus Documents
If user-controlled content can enter your document corpus — a wiki where users create pages, for instance — prompt injection via corpus documents is a real attack surface. A maliciously crafted document can contain instructions that the LLM follows when that document is retrieved as context. Bedrock Guardrails' content filtering helps here, but the more robust solution is treating the RAG context window as untrusted input and applying the same prompt injection mitigations you'd apply to user-supplied text.
Patterns Summary
| Pattern | Class | When to Use |
|---|---|---|
| Hybrid Search (RRF) | HybridRetriever | Any corpus with identifiers, codes, or proper nouns |
| Claude Re-ranking | ClaudeReranker | When top-k retrieval quality bottlenecks answer quality |
| Hallucination Detection | HallucinationDetector | High-stakes domains (legal, medical, finance) |
| Bedrock Guardrails | GuardrailsFilter | Regulated industries; PII or harmful content filtering |
| Full Pipeline | RAGPipeline | Production: all patterns composed in the right order |
The full source, examples, and pattern documentation are at github.com/ivandir/bedrock-rag-patterns. Each class is independently usable — you don't have to adopt the full pipeline to benefit from hybrid search or hallucination detection.