In this blog post Mastering Document Definition in LangChain for Reliable RAG we will explore what a Document means in LangChain, why it matters, and how to structure, chunk, and store it for robust retrieval-augmented generation (RAG).
At a high level, LangChain uses a simple but powerful idea: treat every piece of content as a Document with two parts—text and metadata. This small abstraction drives everything from loading files, splitting text into chunks, embedding into vector stores, to filtering and ranking results at query time. Get it right, and your RAG system becomes accurate, explainable, and cost-efficient. Get it wrong, and you’ll fight noisy answers, governance gaps, and rising GPU bills.
This post focuses on the Document definition in LangChain and the technology behind it—how the schema flows through loaders, text splitters, vector stores, and retrievers—and gives you practical steps and code to implement a clean, scalable approach.
What Document means in LangChain
In LangChain, a Document is the fundamental data unit passed between components. It has:
- page_content: the string text the LLM should reason over
- metadata: a JSON-serializable dict describing the content (source, page, tenant, tags, etc.)
Newer LangChain versions expose this as langchain_core.documents.Document
. Older code may import from langchain.schema
.
Why it matters
- Retrieval quality: Good metadata enables precise filtering and re-ranking.
- Governance: Trace source, version, and access controls per tenant or user.
- Cost control: Right-sized chunks reduce embedding and context costs.
- Debuggability: When answers go wrong, documents with rich metadata make root-cause analysis easy.
The core schema
from langchain_core.documents import Document
doc = Document(
page_content="Acme Corp quarterly report Q2 2025...",
metadata={
"source": "s3://docs/acme/q2-2025.pdf",
"source_type": "pdf",
"page": 12,
"tenant": "acme",
"version": "2025-07-15"
},
)
Notes:
- Metadata should be JSON-serializable (strings, numbers, booleans, lists/dicts).
- LangChain does not enforce a document ID. If you need stable IDs, put them in metadata (e.g.,
doc_id
) and/or passids
when adding to vector stores.
Create documents from common sources
Loaders live in langchain_community
and return a list of Documents.
from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader, TextLoader
# One Document per page for PDFs
pdf_docs = PyPDFLoader("q2-2025.pdf").load()
# Web pages
web_docs = WebBaseLoader(["https://example.com/blog"]).load()
# Plain text files
txt_docs = TextLoader("handbook.txt").load()
Each loader sets sensible defaults in metadata (e.g., source
, page
). You can standardize or enrich that metadata for your system.
Metadata that scales
Treat metadata as a contract for your retrieval layer and governance needs. Practical keys:
- source: URI or path to the canonical file
- source_type: pdf, html, md, txt, email, etc.
- tenant: for multi-tenant isolation
- page, section, heading: navigational anchors
- version or doc_version: content versioning
- labels/tags: topic, department, confidentiality
- ingested_at: ISO timestamp as a string
- doc_id: your stable content identifier (hash, UUID)
Keep metadata compact—some retrievers include metadata in prompts. Large metadata inflates token usage and costs.
import hashlib, datetime as dt
from langchain_core.documents import Document
def normalize_documents(docs, tenant: str, source: str):
norm = []
for d in docs:
meta = {**d.metadata}
meta.setdefault("tenant", tenant)
meta.setdefault("source", source)
meta.setdefault("source_type", source.split(".")[-1].lower())
meta.setdefault("ingested_at", dt.datetime.utcnow().isoformat())
# Stable ID: hash of content (first 12 chars) with tenant prefix
content_hash = hashlib.sha256(d.page_content.encode("utf-8")).hexdigest()[:12]
meta.setdefault("doc_id", f"{tenant}-{content_hash}")
norm.append(Document(page_content=d.page_content, metadata=meta))
return norm
Chunking that helps retrieval
Most vector stores perform best with chunked text. Chunks should be large enough to preserve context but small enough for precise retrieval. As a rule of thumb: 500–1,000 tokens or 600–1,200 characters, with 50–150 overlap.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=120,
separators=["\n\n", "\n", " ", ""]
)
norm = normalize_documents(pdf_docs, tenant="acme", source="q2-2025.pdf")
chunked_docs = splitter.split_documents(norm)
# Tag chunk index to assist tracing and stable ids
for idx, d in enumerate(chunked_docs):
d.metadata["chunk"] = idx
Prefer semantic boundaries (paragraphs, headings) where possible. For token-accurate splits, consider TokenTextSplitter
with a tokenizer like cl100k_base
when using OpenAI models.
Persist and query with a vector store
Once chunked, embed and store Documents. Chroma is a popular local option; production systems often use managed stores (e.g., Elastic, Pinecone, Weaviate, Qdrant).
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
store = Chroma(collection_name="acme-knowledge", embedding_function=embeddings)
ids = [f"{d.metadata['doc_id']}-{d.metadata.get('chunk', 0)}" for d in chunked_docs]
store.add_documents(documents=chunked_docs, ids=ids)
# Query with metadata filters
query = "What did Acme report about revenue in Q2 2025?"
results = store.similarity_search(
query, k=4, filter={"tenant": "acme", "source_type": "pdf"}
)
for r in results:
print(r.page_content[:120], r.metadata)
Filtering is driven by your metadata schema. This is where consistent keys and value types pay off.
Avoid common pitfalls
- Metadata bloat: Giant metadata dicts increase prompt size; keep only what you’ll use.
- Inconsistent types: Don’t mix strings and numbers for the same field (e.g., page). It breaks filters.
- Missing lineage: Always include source and version to make answers explainable and auditable.
- Over/under chunking: Very small chunks lose context; huge chunks hurt retrieval precision and cost.
- Unstable IDs: If you deduplicate or update content, use a stable doc_id strategy to avoid duplicates.
- Leaky multi-tenancy: Always stamp tenant and filter by it on both write and read paths.
- PDF quirks: PDF loaders often return a Document per page. Keep page metadata and combine only when needed.
A quick checklist
- Define a standard metadata schema (source, tenant, version, page/section, tags).
- Enforce JSON-serializable metadata values.
- Create stable
doc_id
s and chunk indices. - Split text into 500–1,000-token chunks with overlap.
- Use filters at query time to isolate tenant/source/type.
- Log and monitor which Documents power answers for traceability.
Final thoughts
LangChain’s Document abstraction is deceptively simple, but it shapes the reliability, security, and cost of your entire RAG stack. By standardizing metadata, right-sizing chunks, and enforcing stable IDs, you give your retriever and LLM the best shot at accurate, auditable answers. Start with a clear schema, automate normalization, and let your Documents do the heavy lifting.
Discover more from CPI Consulting
Subscribe to get the latest posts sent to your email.