How Text Chunking Works for RAG Pipelines

In this blog post How Text Chunking Works for RAG Pipelines and Search Quality we will unpack what text chunking is, why it matters, and how to do it well in real systems.

Text chunking is the practice of splitting large documents into smaller, coherent pieces so downstream systems—like retrieval-augmented generation (RAG) pipelines, search engines, and summarizers—can index and reason over them efficiently. Do it right and you boost recall, precision, and answer quality. Do it wrong and you get context gaps, higher cost, and hallucinations.

What is text chunking

At a high level, chunking turns long, messy input (PDFs, pages, transcripts) into bite-sized, retrievable units. Each chunk carries content plus metadata (source, position, headings). Retrieval uses these chunks to find relevant context that fits within an LLM’s context window and is semantically focused on the question.

For CloudProinc.com.au customers building RAG or enterprise search, chunking is one of the highest-leverage knobs. It’s simple to start, but nuanced to optimize.

The technology behind text chunking

Tokenization and context windows

Modern LLMs process tokens, not characters. Tokenizers (e.g., byte-pair encoding) split text into subword units. LLMs accept a limited number of tokens (“context window”). Chunking manages content so relevant material fits within that limit with minimal noise.

Embeddings and vector similarity

Chunked texts are embedded into numerical vectors. Vector databases (or libraries) use approximate nearest neighbor (ANN) algorithms to quickly retrieve semantically similar chunks. Good chunk boundaries preserve topic coherence, which improves embedding quality and retrieval precision.

Sentence boundary detection and semantics

Basic chunking can be character or token counts. More advanced methods use sentence boundaries, paragraph markers, or even semantic similarity to avoid splitting ideas mid-thought and to keep related sentences together.

Vector stores and metadata

Chunks are stored with metadata: source URL, section, page, position, timestamps, permissions, and version. This powers filtering, traceability, and security. It also supports re-chunking without re-ingesting the entire corpus.

Common chunking strategies

Fixed-size token chunks: Split every N tokens with overlap. Simple, fast, reproducible. Risk: can cut sentences or tables awkwardly.
Sentence-aware chunks: Pack whole sentences until you hit a token budget. Better coherence, slightly more compute.
Semantic splitting: Use embeddings to place boundaries when topic similarity drops. Best coherence but more compute and tuning.
Hierarchical chunks: Two levels—small chunks for retrieval, larger sections for re-ranking or context expansion. Balances recall and depth.
Domain-aware rules: Preserve code blocks, bullets, tables, or headings. Useful for technical docs and transcripts.

Choosing chunk size and overlap

There is no universal best size, but these guidelines work well:

Start with 200–400 tokens per chunk.
Use 10–20% overlap to reduce boundary loss.
Shorter chunks improve recall and indexing speed; longer chunks improve coherence but risk dilution.
Use larger chunks for summarization tasks, smaller for precise Q&A.

Monitor retrieval quality vs. cost. Overlap and larger chunks both increase tokens stored and processed.

Practical implementation steps

Normalize: Clean text, remove boilerplate, preserve semantic markers (headings, lists, code blocks).
Tokenize: Choose a tokenizer comparable to your target LLM.
Chunk: Start with sentence-aware packing under a token budget plus overlap.
Embed: Use a fit-for-purpose embedding model (speed vs. accuracy trade-off).
Store: Save chunks + metadata in a vector store or index.
Retrieve: Top-k semantic search; optionally hybrid with BM25 for lexical recall.
Augment: Feed retrieved chunks to the LLM with source citations.
Evaluate: Measure retrieval recall and answer correctness; iterate on sizes and overlap.

Code examples

The snippets below illustrate token-based and sentence-aware chunking using Tiktoken. They’re intentionally compact and can be adapted to your stack.

Token-based chunking with overlap

# pip install tiktoken
import tiktoken
from typing import List, Dict

enc = tiktoken.get_encoding("cl100k_base")

def chunk_by_tokens(text: str, max_tokens: int = 300, overlap: int = 50) -> List[Dict]:
    assert overlap < max_tokens
    tokens = enc.encode(text)
    step = max_tokens - overlap
    chunks = []
    for i in range(0, len(tokens), step):
        window = tokens[i:i + max_tokens]
        chunk_text = enc.decode(window)
        chunks.append({
            "id": f"chunk_{i}",
            "text": chunk_text,
            "start_token": i,
            "end_token": i + len(window)
        })
    return chunks

Sentence-aware chunking with semantic hinting

# pip install sentence-transformers tiktoken numpy
import re
import numpy as np
from sentence_transformers import SentenceTransformer
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
model = SentenceTransformer("all-MiniLM-L6-v2")

SENT_RE = re.compile(r"(?<=[.!?])\s+")

def split_sentences(text: str):
    return [s.strip() for s in SENT_RE.split(text) if s.strip()]

def sentence_aware_chunks(text: str, max_tokens: int = 350, overlap: int = 50,
                          sim_threshold: float = 0.35):
    sents = split_sentences(text)
    if not sents:
        return []
    embs = model.encode(sents, normalize_embeddings=True)

    chunks, cur, cur_tokens = [], [], 0

    def emit(chunk_sents):
        chunk_text = " ".join(chunk_sents)
        return {
            "text": chunk_text,
            "tokens": len(enc.encode(chunk_text))
        }

    for i, sent in enumerate(sents):
        sent_tokens = len(enc.encode(sent))
        # New boundary if token budget would be exceeded
        if cur and cur_tokens + sent_tokens > max_tokens:
            chunks.append(emit(cur))
            # overlap: keep tail tokens from previous chunk
            if overlap > 0:
                tail = cur[-1:]
                cur, cur_tokens = tail[:], len(enc.encode(" ".join(tail)))
            else:
                cur, cur_tokens = [], 0
        # Semantic boundary: if similarity dips below threshold
        elif cur and i > 0:
            sim = float(np.dot(embs[i-1], embs[i]))
            if sim < sim_threshold and cur_tokens > max_tokens * 0.5:
                chunks.append(emit(cur))
                cur, cur_tokens = [], 0
        cur.append(sent)
        cur_tokens += sent_tokens

    if cur:
        chunks.append(emit(cur))

    # Assign IDs and positions
    for idx, c in enumerate(chunks):
        c["id"] = f"chunk_{idx}"
    return chunks

Notes:

The token-based method is fast and deterministic—great as a baseline.
The sentence-aware variant keeps ideas intact and uses cosine similarity to avoid merging unrelated topics.
Tune max_tokens, overlap, and sim_threshold for your corpus.

Evaluating and tuning chunking

Measure before and after changes. A simple framework:

Retrieval Recall@k: For each question with a known gold answer, does the correct chunk appear in the top-k?
Precision/MRR/nDCG: Rank-sensitive metrics that reflect how high the right chunks appear.
Answer quality: Human or LLM-graded correctness with citations.
Operational metrics: Index size, embedding time, query latency, token usage.

Iterate: adjust chunk sizes and overlap; try sentence-aware vs. fixed; add hybrid retrieval (semantic + BM25). Keep a hold-out set to avoid overfitting.

Operational tips

Metadata first-class: Store source, section, page/time offsets, and version.
Re-chunking: When you change strategy, version chunks so you can roll back.
Hybrid indexes: Combine vector and keyword search for best recall.
Caching and batching: Batch embeddings, cache frequent queries, and pre-compute reranked results for hot content.
Security and tenancy: Keep ACLs with chunks; enforce at retrieval time.
Cost control: Smaller chunks and reasonable overlap manage storage and token costs.

Pitfalls to avoid

Over-chunking: Tiny chunks hurt coherence and increase reassembly overhead.
No overlap: Boundaries can drop crucial context like definitions or variables.
Ignoring structure: Breaking tables, code blocks, or bullet lists harms semantics.
Mismatched tokenizer: Token counts differ by model; use one aligned to your target LLM.
No evaluation loop: Always test with real queries and gold answers.

A quick checklist

Start with 300-token chunks, 15% overlap.
Prefer sentence-aware packing; adopt semantic boundaries if needed.
Store rich metadata and version your chunker.
Measure retrieval and answer metrics; iterate.
Keep cost and latency visible in your dashboards.

Wrap-up

Text chunking turns sprawling documents into high-signal building blocks for RAG and search. With sensible sizes, light overlap, and sentence-aware boundaries, you’ll see better retrieval and fewer hallucinations—without blowing out cost.

If you’re modernizing search or building a RAG pipeline at CloudProinc.com.au scale, treat chunking as a product feature, not a preprocessing footnote. Design it, measure it, and iterate—your users will feel the difference.

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.