How Generative AI Works — Tokens, Embeddings and the Transformer

March 5, 2026 · Updated March 20, 2026 · 8 min read

Using a generative AI tool is easy. Understanding what it is actually doing is harder — and most explanations either stay so surface-level they are not useful, or dive into mathematics so quickly that they lose most readers.

This post lands in the middle. It explains the mechanics — tokens, embeddings, transformers, attention — clearly enough that you understand why the model behaves the way it does. No equations. Just the concepts.

🔗 Foundation posts

Read What is a Large Language Model (LLM)? before this one — it covers what an LLM is at a high level. This post goes one level deeper into the mechanics. Also connects to AI vs ML vs Deep Learning — generative AI is a deep learning application, and the transformer is the deep learning architecture that powers it.

Step 1 — Text becomes tokens

A generative AI model does not process words. It processes tokens — small chunks of text that might be a word, part of a word, or a punctuation mark. Before anything else happens, your input is broken into tokens.

Text	How it tokenises	Token count
Hello	1 token — a single common word	1
Hello, world!	Hello / , / world / !	4
tokenisation	token / isa / tion — uncommon words split further	3
ChatGPT	Chat / G / PT	3
A typical paragraph	Approximately 1 token per 4 characters of English text	75-100

Why tokens and not words? Because the same tokeniser must handle every language, every technical term and every programming language. Word boundaries are not universal. Tokens are a more robust unit for the model to work with.

💡 Why token count matters for you

Every interaction with a generative AI model is measured in tokens — both the input you send and the output it returns. Models have a context window — a maximum token limit for one interaction. GPT-4o has a 128,000-token context window. Claude supports up to 200,000 tokens. Understanding tokens helps you design prompts that stay within limits and use context efficiently.

Step 2 — Tokens become numbers (embeddings)

The model cannot work with text directly. Each token is converted into an embedding — a vector of hundreds or thousands of numbers that represents the meaning of that token in context.

The critical property of embeddings: tokens with similar meanings produce vectors that are mathematically close to each other. ‘King’ and ‘Queen’ are close. ‘King’ and ‘bicycle’ are far apart. This mathematical proximity encodes semantic meaning.

The embedding is where language becomes mathematics. It is the bridge that lets a model work with meaning numerically.

Step 3 — The Transformer processes everything

The Transformer is the architecture that underpins almost every modern generative AI model — GPT, Claude, Gemini, Llama, SAP Joule. It was introduced in a 2017 paper titled ‘Attention Is All You Need’ by researchers at Google.

Before Transformers, language models processed text sequentially — left to right, one token at a time. Transformers changed this fundamentally: they process all tokens simultaneously, with every token able to attend to every other token in the sequence.

Self-attention — the key mechanism

Self-attention is how the Transformer understands context. For each token in the sequence, it computes how much attention to pay to every other token when building the representation of that token.

Consider: ‘The bank can guarantee deposits will cover future tuition costs.’ The word ‘bank’ could mean a financial institution or a river bank. Self-attention looks at ‘deposits’ and ‘tuition costs’ — other tokens in the sentence — and determines the financial meaning is correct. This happens across all tokens simultaneously.

Simplified self-attention for the word ‘bank’:

Token Attention score

‘The’ 0.02 (not very relevant)

‘guarantee’ 0.08 (somewhat relevant)

‘deposits’ 0.71 (highly relevant — financial context clue)

‘tuition’ 0.19 (relevant — supports financial meaning)

Weighted by these scores, ‘bank’ gets a representation that

reflects its financial meaning — not its river bank meaning.

Multiple attention heads

Transformers use multiple attention heads — typically 12 to 96 in modern large models. Each head learns to attend to different types of relationships simultaneously. One head might track grammatical agreement. Another might track which pronoun refers to which noun. Another might track topic consistency.

The outputs of all heads are combined into a rich representation of each token. This is why Transformers handle long-range context — a reference at the start of a long document connecting to a detail at the end — far better than earlier architectures.

Layers — building understanding level by level

A Transformer has many layers — GPT-3 has 96 layers. Each layer refines the representation of each token. Early layers capture word-level and syntactic patterns. Later layers capture semantic and conceptual relationships. The final layer’s representation is what the model uses to predict the next token.

Step 4 — The model generates one token at a time

After the Transformer processes the input, the model produces a probability distribution over every possible next token — the entire vocabulary, which can be 50,000 or more tokens.

It selects one token based on that distribution. That token is added to the context, and the entire process runs again to select the next token. This continues until the response is complete.

What controls token selection	How it works	Effect
Greedy decoding	Always pick the highest-probability token	Deterministic but often repetitive — same prompt gives same answer
Temperature	A scaling factor on the probability distribution. Low = more concentrated, high = more spread out.	Low (0.1-0.3): predictable, conservative. High (0.8-1.0): creative, more varied.
Top-p sampling	Only consider tokens that together make up p% of the probability mass	Balances variety with coherence — most production systems use this
Top-k	Only consider the top k most likely tokens	Simple constraint on the candidate pool

💡 Why the same prompt gives different answers

Temperature and sampling mean the model does not always pick the most likely token. Randomness is intentional — it produces more varied and natural responses. Set temperature to 0 if you need consistent, deterministic output. Increase it for more creative variety.
The full guide to temperature, top-p and top-k — what each does and what to set for your task — is in Temperature and Top-p — Controlling LLM Output.

Why this matters for how you use AI

Understanding the generation mechanism explains behaviours that otherwise seem random:

Hallucinations happen because the model selects the statistically plausible next token — not the verified true one. There is no lookup step.
Responses differ between sessions because sampling introduces randomness — the same prompt at temperature 0.7 will not always produce identical output.
Models get confused in very long conversations because earlier parts of the conversation eventually fall outside the context window.
Asking the model to think step by step (chain-of-thought prompting) works because it forces the intermediate reasoning tokens into the context, improving the quality of the tokens that follow.
SAP Joule, Microsoft Copilot and every other AI assistant you use runs on a transformer — the same architecture. The differences between products are in training data, fine-tuning, RAG grounding and guardrails — not in the fundamental generation mechanism.

At a glance — the mechanics

Concept	One-line summary
Token	The unit the model works with — a word fragment, punctuation or short word
Context window	The maximum tokens the model can process at once — input plus output combined
Embedding	A numerical vector representing a token’s meaning — similar meanings produce close vectors
Transformer	The architecture that processes all tokens simultaneously using self-attention
Self-attention	For each token, deciding how much to focus on every other token to build its representation
Attention head	One self-attention computation — multiple heads run in parallel, each learning different relationships
Layers	The depth of the Transformer — early layers learn syntax, later layers learn semantics
Temperature	Controls randomness in token selection — 0 for deterministic, higher for creative
Token-by-token generation	The model produces one token at a time, adding each to context before selecting the next

What to take away

Generative AI is not a search engine and it is not a database. It is a model that has learned statistical patterns in language at enormous scale and uses those patterns to predict the most plausible next token — one at a time.

The Transformer architecture, self-attention and embeddings are what make this work at the quality level we see today. Knowing these three concepts does not make you a researcher. It does make you a more informed user, a better prompt writer and a more credible evaluator of AI systems — whether you are working with SAP Joule, Microsoft Copilot or any other transformer-based tool.

🔗 Related posts on this site

What is a Large Language Model (LLM)? — the LLM post explains what the model is; this post explains how it generates. Together they give the complete picture.
AI Hallucinations — Why They Happen — hallucination is a direct consequence of token-by-token prediction without verification — now you understand why.
AI vs ML vs Deep Learning — the transformer is a deep learning architecture — that post explains the broader landscape.
Fine-Tuning vs Prompt Engineering vs RAG — knowing how generation works helps you choose the right customisation approach.
How LLMs Are Trained — Pretraining, Fine-Tuning and RLHF — understanding how generation works is only half the picture. This post explains how the model got there in the first place.
Temperature and Top-p — Controlling LLM Output — a full guide to the sampling parameters that decide how the model picks each token: when to use temperature 0 for precision and when to go higher for creativity.

Published on rakeshnarayan.com — Articles

URL: https://rakeshnarayan.com/articles/how-generative-ai-works-tokens-embeddings-and-the-transformer/

How Generative AI WorksTransformer ArchitectureTokens ExplainedEmbeddings AIAttention MechanismHow LLMs WorkTokenisationSelf-AttentionGenerative AI MechanicsNeural Networks TextTemperature AIGPT ArchitectureHow ChatGPT WorksAI Fundamentals

How Generative AI Works — Tokens, Embeddings and the Transformer

Step 1 — Text becomes tokens

Step 2 — Tokens become numbers (embeddings)

Step 3 — The Transformer processes everything

Self-attention — the key mechanism

Multiple attention heads

Layers — building understanding level by level

Step 4 — The model generates one token at a time

Why this matters for how you use AI

At a glance — the mechanics

What to take away

0 Comments

Leave a Comment

Step 1 — Text becomes tokens

Step 2 — Tokens become numbers (embeddings)

Step 3 — The Transformer processes everything

Self-attention — the key mechanism

Multiple attention heads

Layers — building understanding level by level

Step 4 — The model generates one token at a time

Why this matters for how you use AI

At a glance — the mechanics

What to take away

0 Comments

Leave a Comment

Related Articles

Open Source vs Closed Source AI Models — The Real Trade-offs

How LLMs Are Trained — Pretraining, Fine-Tuning and RLHF