Stop Using Fixed-Size Chunking
Dec 30, 2025 • 20 min read
If you split text every 512 characters, you will inevitably break a sentence in the middle. The first chunk ends with "The treatment works best when" and the next begins with "combined with physical therapy." Your retriever now has two meaningless half-ideas. This is the fundamental problem with fixed-size chunking — it destroys semantic coherence. Advanced RAG pipelines use content-aware splitting strategies: semantic chunking, parent-document retrieval, and proposition-level agentic splitting.
1. The Problem with Naive Chunking
| Strategy | How It Splits | Main Weakness |
|---|---|---|
| Fixed-size | Every N characters | Breaks sentences mid-thought; ignores content structure |
| Recursive character | Tries "\n\n", then "\n", then " ", then char | Better than fixed, but still length-based not meaning-based |
| Sentence splitting | Splits on sentence boundaries (NLTK/spaCy) | Individual sentences often lack context without surrounding sentences |
| Semantic chunking | Splits when embedding similarity drops | Requires embedding model at ingestion time; slower |
| Agentic/Proposition | LLM extracts discrete facts | Most accurate but expensive (1 LLM call per document) |
2. Semantic Chunking
Semantic chunking calculates the cosine similarity between each adjacent sentence pair. When similarity drops below a threshold (a "break point"), it starts a new chunk — grouping sentences that discuss the same topic:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# SemanticChunker embeds every sentence, then splits at similarity drops
text_splitter = SemanticChunker(
OpenAIEmbeddings(model="text-embedding-3-small"),
# breakpoint_threshold_type options:
# "percentile" — split at the Nth percentile of similarity drops (default)
# "standard_deviation" — split when drop > mean - 1.5*std
# "interquartile" — split when drop in lower IQR range
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=75, # Split at drops in bottom 75th percentile
)
# Example: Medical document with two topics
text = """
Aspirin reduces fever by inhibiting prostaglandin synthesis. It is commonly
prescribed for pain relief at doses of 325mg to 650mg every 4-6 hours.
The mechanism involves blocking COX-1 and COX-2 enzymes.
The Python programming language was created by Guido van Rossum and released
in 1991. Python emphasizes code readability with its use of indentation.
It is widely used in data science, web development, and automation.
"""
docs = text_splitter.create_documents([text])
print(f"Created {len(docs)} chunks")
# → Created 2 chunks
# Chunk 1: Everything about Aspirin (all medically similar)
# -- Semantic break: medical → programming (cosine similarity drops from 0.87 to 0.23) --
# Chunk 2: Everything about Python
# Compare to fixed-size chunking (500 chars) which might split:
# Chunk 1: "...The mechanism involves blocking COX-1 and COX-2 enzymes. The Python prog..."
# ← Medical + Python mixed in same chunk! Retrieval is degraded.3. Parent-Document Retrieval
Small chunks are better for search precision (narrow semantic target), but bad for generation (too little context for the LLM). Parent-document retrieval solves this tension:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
# Two splitters: small children for search, large parents for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200) # Small: precise search
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) # Large: rich context
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
docstore = InMemoryStore() # Stores parent documents (use Redis in production)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore, # Indexes small child chunks for search
docstore=docstore, # Stores large parent chunks for retrieval
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Ingest: automatically creates child chunks linked to parent chunks
retriever.add_documents(your_documents)
# Query flow:
# 1. User asks: "How does aspirin affect COX-2?"
# 2. Retriever searches vectorstore using small child chunks → finds match
# 3. Retriever looks up the parent document for that child
# 4. Returns the FULL PARENT CHUNK (2000 chars) to the LLM — rich context!
#
# Result: Search precision of 200-char chunks + generation quality of 2000-char context
results = retriever.invoke("How does aspirin affect COX-2 enzymes?")4. Agentic / Proposition-Level Chunking
The most accurate technique: use an LLM to extract discrete atomic facts from documents. Each "proposition" is a standalone, self-contained statement:
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class PropositionList(BaseModel):
propositions: list[str]
def extract_propositions(text: str) -> list[str]:
"""
Use LLM to decompose text into atomic, self-contained propositions.
Each proposition should be understandable without surrounding context.
"""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini", # Use mini for cost efficiency
messages=[{
"role": "system",
"content": """Decompose the following text into a list of atomic,
self-contained propositions. Each proposition must:
1. Be a single, complete factual statement
2. Be understandable without any surrounding context
3. Have any pronouns replaced with the actual noun they reference
4. Not combine multiple facts in one statement
Example:
Input: "Einstein, who was born in 1879, developed the theory of relativity."
Output:
- Albert Einstein was born in 1879.
- Albert Einstein developed the theory of general relativity.
- Albert Einstein developed the theory of special relativity."""
}, {
"role": "user",
"content": text,
}],
response_format=PropositionList,
)
return response.choices[0].message.parsed.propositions
# Example output for a medical document paragraph:
propositions = extract_propositions("""
Aspirin was first synthesized by Felix Hoffmann at Bayer in 1897. It reduces
fever by inhibiting prostaglandin synthesis. Clinical trials show it reduces
the risk of cardiovascular events by 25% in high-risk patients.
""")
# → [
# "Aspirin was first synthesized by Felix Hoffmann.",
# "Aspirin was synthesized at Bayer in 1897.",
# "Aspirin reduces fever by inhibiting prostaglandin synthesis.",
# "Clinical trials show aspirin reduces cardiovascular event risk in high-risk patients.",
# "Aspirin reduces cardiovascular event risk by 25% in high-risk patients.",
# ]
# Why this is powerful for RAG:
# Query: "When was aspirin invented?"
# → Exactly matches proposition 2 (date)
# Without agentic chunking, the date might be buried in a large text block5. Hierarchical Indexing (Multi-Vector)
# Advanced: Index multiple representations of the same document
# and search all of them, then return the original document
from langchain.retrievers.multi_vector import MultiVectorRetriever
retriever = MultiVectorRetriever(vectorstore=vectorstore, docstore=docstore)
# For each document, store multiple search targets:
# 1. Original chunk (for full-text matching)
# 2. LLM-generated summary (for high-level questions)
# 3. Hypothetical questions the chunk would answer (HyDE-style)
def generate_hypothetical_questions(chunk_text: str) -> list[str]:
"""Generate questions that this chunk would be the good answer for."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"""
Generate 3 different questions that the following text would be a good answer for:
{chunk_text}
Return only the questions, one per line."""}]
)
return response.choices[0].message.content.strip().split('
')
# Index both original chunks AND their hypothetical questions
# When a user asks a question, it matches the hypothetical question →
# retrieves the original chunk → sends to LLM
# This dramatically improves recall for diverse phrasings of queriesFrequently Asked Questions
Which chunking strategy should I start with?
Start with RecursiveCharacterTextSplitter with a 512-token chunk size and 50-token overlap — it's good enough for 80% of use cases and requires zero embedding model calls at ingestion. Add semantic chunking when you see quality issues with topic-mixed chunks. Add parent-document retrieval when users complain that answers lack context. Reserve agentic/proposition chunking for high-stakes applications (medical, legal, financial) where precision is critical enough to justify the 10-100x ingestion cost.
What chunk size is optimal?
There's no universal optimal chunk size — it depends on your embedding model and content. GPT-4o can handle 8k context, so retrieved chunks can be long. A practical approach: start at 512 tokens with 50-token overlap, compute eval metrics (answer relevance, context precision), then try 256 and 1024 to see if scores improve. For technical documentation with long code examples, 1024-2048 tokens often works better since code examples shouldn't be split.
Conclusion
Your RAG pipeline's quality ceiling is determined by your chunking strategy. Fixed-size chunking is a starting point, not a final solution. Semantic chunking ensures semantically coherent chunks, parent-document retrieval balances search precision with generation context, and proposition-level splitting maximizes retrieval precision for complex documents. Most production RAG systems use a combination: semantic chunking for initial ingestion, with parent-document retrieval to serve full context to the LLM at generation time.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.