Deep Dives12 min read25 May 2026

Tokens, Embeddings, and Vector Search: The Building Blocks of Modern AI

Behind every chatbot, semantic search, and RAG system are three concepts: tokens, embeddings, and vector databases. Master them and you understand AI at the infrastructure level.

The word "unhappiness" tokenises as ["un", "happiness"] → [403, 88751] in GPT-4's tokeniser. "Tokenization" → ["Token", "ization"] → [3561, 1634]. "Hello, world!" → ["Hello", ",", " world", "!"] → [9906, 11, 1917, 0]. Spaces and punctuation are often part of the token, not separate tokens.

python

# pip install tiktoken
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

examples = [
    "The quick brown fox jumps over the lazy dog.",
    "안녕하세요, 반갑습니다.",  # Korean: Hello, nice to meet you
    "def calculate_revenue(price, quantity): return price * quantity",
    '{"name": "Alice", "role": "engineer", "department": "product"}',
]

for text in examples:
    tokens = enc.encode(text)
    token_count = len(tokens)
    word_count = len(text.split())
    print(f"Text: {text[:50]}...")
    print(f"  Tokens: {token_count}, Words: {word_count}, Ratio: {token_count/word_count:.2f} tokens/word")
    print()

# Typical output:
# English sentence: ~1.2 tokens/word
# Korean text: ~3.5 tokens/word
# Python code: ~1.8 tokens/word
# Dense JSON: ~2.4 tokens/word

The classic demonstration: embed "king", "man", "woman", and "queen". The vector arithmetic king − man + woman ≈ queen holds in embedding space. The model has learned gender and royalty as independent dimensions that combine linearly. This is not programmed — it emerges from training on text where these relationships are statistically consistent.

For semantic search with text embeddings, cosine similarity is the standard choice. OpenAI's embedding documentation specifically recommends cosine similarity for their models. If you normalise all vectors to unit length (L2 normalisation), cosine similarity and dot product become equivalent — which is why many vector databases normalise by default.

python

# pip install openai numpy
import openai
import numpy as np

client = openai.OpenAI()

def embed(texts: list[str]) -> np.ndarray:
    """Embed a list of texts, return as numpy array."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return np.array([d.embedding for d in response.data])

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Compute cosine similarity between query vector a and matrix b."""
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b, axis=1, keepdims=True)
    return b_norm @ a_norm

# ── Document corpus ────────────────────────────────────────────────────────────
documents = [
    "How to set up automatic billing with Stripe Checkout",
    "Getting started with React and TypeScript",
    "Understanding credit card payment processing fees",
    "Introduction to PostgreSQL indexing strategies",
    "Webhook verification and security best practices",
    "Building real-time features with WebSockets",
    "How to handle payment failures and retries",
    "Database migration strategies for production systems",
]

print("Embedding documents...")
doc_embeddings = embed(documents)
print(f"Document matrix shape: {doc_embeddings.shape}")  # (8, 1536)

# ── Semantic search ────────────────────────────────────────────────────────────
query = "What happens when a payment fails?"
print(f"\nQuery: '{query}'")

query_embedding = embed([query])[0]
similarities = cosine_similarity(query_embedding, doc_embeddings)

# Get top 3 most similar
top_indices = np.argsort(similarities)[::-1][:3]

print("\nTop 3 results:")
for i, idx in enumerate(top_indices):
    print(f"  {i+1}. [{similarities[idx]:.3f}] {documents[idx]}")

# Expected output (approximately):
# 1. [0.847] How to handle payment failures and retries
# 2. [0.723] How to set up automatic billing with Stripe Checkout
# 3. [0.681] Understanding credit card payment processing fees
#
# Notice: "payment fails" → "payment failures and retries" matched
# despite no exact keyword overlap with "fails" ≠ "failures"

python

# pip install supabase openai
import openai
from supabase import create_client

openai_client = openai.OpenAI()
supabase = create_client("YOUR_SUPABASE_URL", "YOUR_SERVICE_ROLE_KEY")

# ── SQL to run in Supabase first: ─────────────────────────────────────────────
# CREATE EXTENSION IF NOT EXISTS vector;
# CREATE TABLE documents (
#   id BIGSERIAL PRIMARY KEY,
#   content TEXT,
#   embedding VECTOR(1536)
# );
# CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
#   WITH (lists = 100);  -- tune lists to sqrt(row_count)

def embed(text: str) -> list[float]:
    resp = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return resp.data[0].embedding

def index_document(content: str):
    """Embed and store a document."""
    embedding = embed(content)
    supabase.table("documents").insert({
        "content": content,
        "embedding": embedding
    }).execute()

def search(query: str, top_k: int = 5) -> list[dict]:
    """Find the most semantically similar documents to a query."""
    query_embedding = embed(query)
    # pgvector cosine distance: <=> operator (1 - cosine_similarity)
    result = supabase.rpc("match_documents", {
        "query_embedding": query_embedding,
        "match_count": top_k
    }).execute()
    return result.data

# Corresponding SQL function in Supabase:
# CREATE OR REPLACE FUNCTION match_documents(
#   query_embedding VECTOR(1536),
#   match_count INT
# ) RETURNS TABLE (id BIGINT, content TEXT, similarity FLOAT) AS $$
#   SELECT id, content, 1 - (embedding <=> query_embedding) AS similarity
#   FROM documents
#   ORDER BY embedding <=> query_embedding
#   LIMIT match_count;
# $$ LANGUAGE SQL;

tokensembeddingsvector searchdeep divesRAG

🎓Interactive Courses

Ready to go further?

Take the interactive course — daily lessons, real exercises, XP and streaks. Turn reading into lasting skills.

Daily streaksXP & levels

Start a course