- Published on
- ·5 min read
Building RAG on Azure: Chunking Strategies That Actually Work
Chunking is the step most teams rush past.
You pick a chunk size, split the document, embed everything, and move on. Then you wonder why your RAG system keeps answering questions with the right general topic but wrong specific details.
The problem is almost always chunking.
Here's what actually works — and why.
Why Chunking Matters More Than the Model
Your retrieval step fetches the top K chunks from your vector index. The LLM then answers the question using only those chunks as context.
If the chunk that contains the answer is split across two chunks, neither chunk fully answers the question. The LLM gets partial context and fills in the rest from training data — which is hallucination.
Document: "The quarterly maintenance window is every Sunday from 2-4 AM UTC.
Deployments are blocked during this window."
Bad chunk boundary:
Chunk A: "The quarterly maintenance window is every Sunday from 2-4"
Chunk B: "AM UTC. Deployments are blocked during this window."
Search: "When are deployments blocked?"
Retrieved: Chunk B ← correct content, but missing the time context
LLM answers: "Deployments are blocked during the maintenance window" ← vague non-answer
Good chunking keeps semantically related sentences together.
Strategy 1: Fixed-Size Chunking
Split every N characters (or tokens), with optional overlap.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # target characters per chunk
chunk_overlap=50, # overlap between chunks
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)
Pros:
- Simple, predictable, fast
- Works well for uniform content (support tickets, short articles)
Cons:
- Splits sentences mid-thought
- Ignores document structure
- Overlap helps but doesn't fix boundary problems
When to use: Log files, structured records, content where every sentence is independent.
Strategy 2: Recursive Character Chunking
Split on natural language boundaries first (paragraphs → sentences → words), falling back to characters only when needed. The RecursiveCharacterTextSplitter above already does this with the separators parameter.
The key improvement: it tries \n\n first (paragraph breaks), then \n (line breaks), then . (sentence ends). Characters are a last resort, not the default.
Tune for your content type:
# For markdown documentation
md_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["## ", "### ", "\n\n", "\n", ". "]
)
# For code (preserve function boundaries)
code_splitter = RecursiveCharacterTextSplitter(
chunk_size=1200,
chunk_overlap=200,
separators=["\nclass ", "\ndef ", "\n\n", "\n"]
)
This is the best general-purpose strategy for most document types.
Strategy 3: Semantic Chunking
Group sentences by their semantic similarity rather than by length. Sentences that are about the same topic stay together; sentences that shift topic create a chunk boundary.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
azure_deployment="text-embedding-3-large",
azure_endpoint="https://your-resource.openai.azure.com",
api_key="...",
api_version="2024-02-01"
)
splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile", # or "standard_deviation"
breakpoint_threshold_amount=95, # split when similarity drops below 95th percentile
)
chunks = splitter.split_text(document_text)
Pros:
- Chunks reflect actual topic boundaries
- Works well for heterogeneous documents (reports, PDFs with multiple sections)
- Better retrieval recall for semantic search
Cons:
- Slow — embeds every sentence individually to compute similarity
- Chunk sizes vary wildly (some chunks 100 tokens, some 2000)
- Costs more (more embedding API calls during indexing)
When to use: Long, complex documents where section boundaries matter. Annual reports, technical specifications, legal documents.
Strategy 4: Parent-Child Chunking
Index fine-grained child chunks for search precision, but retrieve the larger parent chunk to give the LLM more context.
# Parent chunk: full section (~1000 tokens)
# Child chunk: single paragraph (~150 tokens) — used for search
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.vectorstores import AzureSearch
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
store = InMemoryStore() # or use Redis, Azure Blob, etc.
vectorstore = AzureSearch(...)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(documents)
How it works: Search finds the right 200-token child chunk. Before sending to the LLM, retrieve the full 1000-token parent — so the model gets full context, not just the matching sentence.
When to use: Documents where you need precise matching (child chunks) but the answer requires surrounding context (parent chunks). This is the most effective strategy for dense technical documentation.
Metadata Enrichment
Regardless of chunking strategy, always attach metadata to chunks before indexing.
def chunk_with_metadata(doc_path: str, content: str, chunks: list[str]) -> list[dict]:
return [
{
"content": chunk,
"source": doc_path,
"doc_title": extract_title(content),
"section": detect_section_heading(chunk),
"chunk_index": i,
"total_chunks": len(chunks),
"word_count": len(chunk.split()),
"created_at": datetime.utcnow().isoformat(),
}
for i, chunk in enumerate(chunks)
]
Why this matters: When the LLM gives a wrong answer, you need to trace it back to which chunk caused it. Without metadata, you have no way to debug retrieval quality.
Also add metadata to your Azure AI Search index as filterable fields. This lets you scope searches to specific documents, date ranges, or document types.
Benchmark Results (What We See in Practice)
| Strategy | Retrieval Accuracy | Index Speed | Chunk Consistency |
|---|---|---|---|
| Fixed-size | 62% | Fast | High |
| Recursive character | 74% | Fast | High |
| Semantic | 81% | Slow (10x) | Low |
| Parent-child | 83% | Medium | Medium |
Numbers from our internal evaluations on 500-document technical knowledge bases. Your results will vary based on document type.
The retrieval accuracy is measured as: what percentage of test queries retrieve at least one chunk containing the ground-truth answer.
Recommended Starting Point
For a new RAG system on Azure:
- Start with recursive character chunking, chunk size 600–800 tokens, overlap 100 tokens
- Add rich metadata (source, section, date) to every chunk
- Use Azure AI Search with hybrid search (keyword + vector)
- Run evaluation queries on 50–100 test questions, check if the right chunk was retrieved
- If retrieval accuracy is below 70%, try parent-child chunking
- Only move to semantic chunking if retrieval accuracy matters more than indexing cost
The right chunking strategy depends entirely on your documents. Test, measure, iterate.