Document Definition in LangChain

In this blog post Mastering Document Definition in LangChain for Reliable RAG we will explore what a Document means in LangChain, why it matters, and how to structure, chunk, and store it for robust retrieval-augmented generation (RAG).

At a high level, LangChain uses a simple but powerful idea: treat every piece of content as a Document with two parts—text and metadata. This small abstraction drives everything from loading files, splitting text into chunks, embedding into vector stores, to filtering and ranking results at query time. Get it right, and your RAG system becomes accurate, explainable, and cost-efficient. Get it wrong, and you’ll fight noisy answers, governance gaps, and rising GPU bills.

This post focuses on the Document definition in LangChain and the technology behind it—how the schema flows through loaders, text splitters, vector stores, and retrievers—and gives you practical steps and code to implement a clean, scalable approach.

What Document means in LangChain

In LangChain, a Document is the fundamental data unit passed between components. It has:

page_content: the string text the LLM should reason over
metadata: a JSON-serializable dict describing the content (source, page, tenant, tags, etc.)

Newer LangChain versions expose this as langchain_core.documents.Document. Older code may import from langchain.schema.

Why it matters

Retrieval quality: Good metadata enables precise filtering and re-ranking.
Governance: Trace source, version, and access controls per tenant or user.
Cost control: Right-sized chunks reduce embedding and context costs.
Debuggability: When answers go wrong, documents with rich metadata make root-cause analysis easy.

The core schema

from langchain_core.documents import Document

doc = Document(
    page_content="Acme Corp quarterly report Q2 2025...",
    metadata={
        "source": "s3://docs/acme/q2-2025.pdf",
        "source_type": "pdf",
        "page": 12,
        "tenant": "acme",
        "version": "2025-07-15"
    },
)

Notes:

Metadata should be JSON-serializable (strings, numbers, booleans, lists/dicts).
LangChain does not enforce a document ID. If you need stable IDs, put them in metadata (e.g., doc_id) and/or pass ids when adding to vector stores.

Create documents from common sources

Loaders live in langchain_community and return a list of Documents.

from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader, TextLoader

# One Document per page for PDFs
pdf_docs = PyPDFLoader("q2-2025.pdf").load()

# Web pages
web_docs = WebBaseLoader(["https://example.com/blog"]).load()

# Plain text files
txt_docs = TextLoader("handbook.txt").load()

Each loader sets sensible defaults in metadata (e.g., source, page). You can standardize or enrich that metadata for your system.

Metadata that scales

Treat metadata as a contract for your retrieval layer and governance needs. Practical keys:

source: URI or path to the canonical file
source_type: pdf, html, md, txt, email, etc.
tenant: for multi-tenant isolation
page, section, heading: navigational anchors
version or doc_version: content versioning
labels/tags: topic, department, confidentiality
ingested_at: ISO timestamp as a string
doc_id: your stable content identifier (hash, UUID)

Keep metadata compact—some retrievers include metadata in prompts. Large metadata inflates token usage and costs.

import hashlib, datetime as dt
from langchain_core.documents import Document


def normalize_documents(docs, tenant: str, source: str):
    norm = []
    for d in docs:
        meta = {**d.metadata}
        meta.setdefault("tenant", tenant)
        meta.setdefault("source", source)
        meta.setdefault("source_type", source.split(".")[-1].lower())
        meta.setdefault("ingested_at", dt.datetime.utcnow().isoformat())
        # Stable ID: hash of content (first 12 chars) with tenant prefix
        content_hash = hashlib.sha256(d.page_content.encode("utf-8")).hexdigest()[:12]
        meta.setdefault("doc_id", f"{tenant}-{content_hash}")
        norm.append(Document(page_content=d.page_content, metadata=meta))
    return norm

Chunking that helps retrieval

Most vector stores perform best with chunked text. Chunks should be large enough to preserve context but small enough for precise retrieval. As a rule of thumb: 500–1,000 tokens or 600–1,200 characters, with 50–150 overlap.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120,
    separators=["\n\n", "\n", " ", ""]
)

norm = normalize_documents(pdf_docs, tenant="acme", source="q2-2025.pdf")
chunked_docs = splitter.split_documents(norm)

# Tag chunk index to assist tracing and stable ids
for idx, d in enumerate(chunked_docs):
    d.metadata["chunk"] = idx

Prefer semantic boundaries (paragraphs, headings) where possible. For token-accurate splits, consider TokenTextSplitter with a tokenizer like cl100k_base when using OpenAI models.

Persist and query with a vector store

Once chunked, embed and store Documents. Chroma is a popular local option; production systems often use managed stores (e.g., Elastic, Pinecone, Weaviate, Qdrant).

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
store = Chroma(collection_name="acme-knowledge", embedding_function=embeddings)

ids = [f"{d.metadata['doc_id']}-{d.metadata.get('chunk', 0)}" for d in chunked_docs]
store.add_documents(documents=chunked_docs, ids=ids)

# Query with metadata filters
query = "What did Acme report about revenue in Q2 2025?"
results = store.similarity_search(
    query, k=4, filter={"tenant": "acme", "source_type": "pdf"}
)
for r in results:
    print(r.page_content[:120], r.metadata)

Filtering is driven by your metadata schema. This is where consistent keys and value types pay off.

Avoid common pitfalls

Metadata bloat: Giant metadata dicts increase prompt size; keep only what you’ll use.
Inconsistent types: Don’t mix strings and numbers for the same field (e.g., page). It breaks filters.
Missing lineage: Always include source and version to make answers explainable and auditable.
Over/under chunking: Very small chunks lose context; huge chunks hurt retrieval precision and cost.
Unstable IDs: If you deduplicate or update content, use a stable doc_id strategy to avoid duplicates.
Leaky multi-tenancy: Always stamp tenant and filter by it on both write and read paths.
PDF quirks: PDF loaders often return a Document per page. Keep page metadata and combine only when needed.

A quick checklist

Define a standard metadata schema (source, tenant, version, page/section, tags).
Enforce JSON-serializable metadata values.
Create stable doc_ids and chunk indices.
Split text into 500–1,000-token chunks with overlap.
Use filters at query time to isolate tenant/source/type.
Log and monitor which Documents power answers for traceability.

Final thoughts

LangChain’s Document abstraction is deceptively simple, but it shapes the reliability, security, and cost of your entire RAG stack. By standardizing metadata, right-sizing chunks, and enforcing stable IDs, you give your retriever and LLM the best shot at accurate, auditable answers. Start with a clear schema, automate normalization, and let your Documents do the heavy lifting.

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.