Understanding OpenAI Embedding Models

In this blog post Understanding OpenAI Embedding Models and Practical Ways to Use Them we will unpack what OpenAI’s embedding models are, how they work under the hood, and how to put them to work in search, retrieval augmented generation (RAG), clustering, and analytics.

At a high level, an embedding is a numerical representation of text (or other data) that places similar things near each other in a high-dimensional space.

If two pieces of text mean similar things, their vectors will be close by—so you can search, rank, or cluster by measuring geometric distance instead of doing brittle keyword matches. OpenAI’s embedding models generate those vectors from raw text using state-of-the-art transformer architectures, making semantic operations fast, flexible, and language-aware.

What OpenAI’s embedding models are

OpenAI provides encoder-style transformer models that map text to dense vectors. Common choices include:

text-embedding-3-small: a cost-efficient, 1536-dimensional embedding for most production search/RAG workloads.
text-embedding-3-large: a higher-accuracy, 3072-dimensional embedding for precision-sensitive ranking, deduplication, or analytics.

Both models return fixed-length float arrays. You can compare vectors with cosine similarity, dot product, or Euclidean distance—cosine is the most common for text.

How the technology works (without the jargon overload)

Under the hood, embedding models are transformer encoders. Here’s the gist:

Tokenization breaks text into subword tokens.
A deep transformer network processes those tokens to capture context and meaning.
A final projection layer produces a single vector per input (often after pooling token states).
Training objectives nudge semantically similar texts closer together and dissimilar texts farther apart. This often combines next-token prediction pretraining with contrastive or similarity-focused fine-tuning.

The result is a vector space where distances reflect semantic similarity. Because the model encodes context, embeddings can match paraphrases and synonyms—even when keywords differ.

Why embeddings are useful

Semantic search: Rank results by meaning rather than exact words.
RAG for LLMs: Retrieve relevant passages to ground model outputs in your data.
Clustering and topic discovery: Group documents by meaning to explore large corpora.
Deduplication and near-duplicate detection: Spot overlap in content at scale.
Recommendation and matching: Connect users with similar items, profiles, or questions.

Key concepts you should know

Vector dimensionality: Larger vectors (e.g., 3072 dims) can capture more nuance but cost more to compute and store.
Similarity metric: Cosine similarity is standard. Normalize vectors to unit length for consistent comparisons.
Chunking: Break long documents into chunks (often 200–400 tokens) with small overlaps so each chunk conveys a coherent idea.
ANN indexing: Use approximate nearest neighbor indexes (e.g., HNSW, IVF) in a vector database to keep query latency low.
Versioning: Store the model name alongside each embedding. Re-embed if you upgrade models to maintain consistency.

Practical steps to build with embeddings

Define the task: search, RAG, deduplication, clustering, or recommendations.
Pick a model: start with text-embedding-3-small; move to -large for harder ranking problems.
Prepare text: clean, normalize whitespace, strip boilerplate, and chunk long docs.
Generate embeddings: batch requests for throughput; retry on transient errors.
Store vectors: use a vector DB (Pinecone, Qdrant, Weaviate), pgvector on Postgres, or FAISS for local search.
Query and rank: embed the query, search the index, and re-rank if needed.
Evaluate: measure relevance and latency; tune chunk sizes, filters, and model choice.

Getting started with the OpenAI API

Python example

from openai import OpenAI
client = OpenAI()

texts = [
    "How do I reset my password?",
    "To reset your password, click 'Forgot Password' on the sign-in page.",
]

resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts,
)

embeddings = [d.embedding for d in resp.data]
print(len(embeddings), len(embeddings[0]))  # count, dimensions

JavaScript example

import OpenAI from "openai";
const openai = new OpenAI();

const input = "Best way to back up a PostgreSQL database";
const { data } = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input,
});

const embedding = data[0].embedding; // Float32-like array
console.log(embedding.length);

Cosine similarity helper (Python)

import numpy as np

def cosine_similarity(a, b):
    a = np.array(a, dtype=np.float32)
    b = np.array(b, dtype=np.float32)
    a = a / np.linalg.norm(a)
    b = b / np.linalg.norm(b)
    return float(np.dot(a, b))

Storing vectors and searching

You can use many backends. For Postgres with pgvector:

-- One-time setup
CREATE EXTENSION IF NOT EXISTS vector;

-- Use the dimension that matches your model (e.g., 1536 or 3072)
CREATE TABLE docs (
  id BIGSERIAL PRIMARY KEY,
  title TEXT,
  content TEXT,
  embedding VECTOR(1536),
  model TEXT
);

-- HNSW index for fast cosine similarity
CREATE INDEX docs_embedding_hnsw ON docs USING hnsw (embedding vector_cosine_ops);

When querying, embed the user query, then search with your vector index using the cosine distance operator your pgvector version supports. Always store the model name with each row so you know which embeddings you’re searching over.

Embedding best practices

Normalize inputs: lowercase where appropriate, remove markup that doesn’t carry meaning, keep numbers/IDs if they’re important for retrieval.
Chunk smartly: aim for 200–400 tokens; include brief overlap (10–20%) so context isn’t cut mid-thought.
Batch requests: send 16–256 texts per API call to reduce overhead, respecting rate limits.
Normalize vectors: many libraries expect unit-length vectors for cosine similarity.
Hybrid search: combine BM25/keyword with embeddings for the best of precision and recall.
Cache and deduplicate: hash content to avoid re-embedding unchanged text.
Track metadata: source, timestamp, language, and model name; it’s invaluable for audits and reprocessing.

RAG in a nutshell

Retrieval augmented generation uses embeddings to fetch relevant context, then feeds that context to an LLM to answer questions grounded in your data.

Embed and index your documents.
Embed the user query.
Vector search to get top-k chunks.
Compose a prompt with the retrieved chunks.
Call your chosen LLM to generate the answer.

Quality tips: use domain-specific chunking, filter by metadata (e.g., product, region), and consider re-ranking the top results before prompting the LLM.

Evaluating quality

Create a small labeled set of queries and expected documents.
Measure recall@k and MRR for semantic search.
For RAG, score final answers for groundedness and factual accuracy.
Try both text-embedding-3-small and -large; measure the trade-off in accuracy vs. cost/latency.

Performance and cost considerations

Latency: pre-embed your corpus offline; only the query embedding is real-time.
Storage: 1536-dim vectors consume less space than 3072-dim; consider product quantization or scalar quantization if your DB supports it.
Throughput: prefer batch embedding; parallelize across workers where safe.
Costs: embeddings are billed per input token. Shorter chunks and deduplication reduce spend—check the current pricing page before large-scale runs.

Security and data handling

Minimize sensitive data in embeddings; avoid unnecessary PII.
Store raw text and vectors securely with appropriate access controls.
Review your provider’s data use and retention policies. With OpenAI’s API, customer data sent via the API is not used to train OpenAI models by default; verify current terms for your account.

Common pitfalls

Mixing models: don’t compare vectors across different embedding models or dimensions.
Ignoring normalization: cosine math assumes unit-length vectors.
Overly large chunks: long, unfocused chunks hurt retrieval precision.
One-size-fits-all thresholds: tune similarity cutoffs per domain.
Skipping evaluation: always test with real queries and iterate.

When to choose small vs. large

Use text-embedding-3-small for most apps: general search, RAG, support bots, analytics at scale.
Use text-embedding-3-large when mis-rankings are costly: critical search, legal/medical domains, high-stakes deduplication, or when you need the last bit of recall.

Wrapping up

OpenAI’s embedding models turn text into vectors that capture meaning, enabling semantic search, RAG, clustering, and more. Start small: pick a model, chunk your data, index with a vector database, and measure results. With a few best practices—normalization, hybrid search, and careful evaluation—you’ll get reliable, scalable semantic capabilities into production quickly.

Discover more from CPI Consulting -Specialist Azure Consultancy

Subscribe to get the latest posts sent to your email.