How Generative AI Works — Tokens, Embeddings and the Transformer
Using a generative AI tool is easy. Understanding what it is actually doing is harder — and most explanations either stay so surface-level they are not useful, or dive into mathematics so quickly that they lose most readers.
This post lands in the middle. It explains the mechanics — tokens, embeddings, transformers, attention — clearly enough that you understand why the model behaves the way it does. No equations. Just the concepts.
🔗 Foundation posts
Read What is a Large Language Model (LLM)? before this one — it covers what an LLM is at a high level. This post goes one level deeper into the mechanics. Also connects to AI vs ML vs Deep Learning — generative AI is a deep learning application, and the transformer is the deep learning architecture that powers it.
Step 1 — Text becomes tokens
A generative AI model does not process words. It processes tokens — small chunks of text that might be a word, part of a word, or a punctuation mark. Before anything else happens, your input is broken into tokens.
| Text | How it tokenises | Token count |
|---|---|---|
| Hello | 1 token — a single common word | 1 |
| Hello, world! | Hello / , / world / ! | 4 |
| tokenisation | token / isa / tion — uncommon words split further | 3 |
| ChatGPT | Chat / G / PT | 3 |
| A typical paragraph | Approximately 1 token per 4 characters of English text | 75-100 |
Why tokens and not words? Because the same tokeniser must handle every language, every technical term and every programming language. Word boundaries are not universal. Tokens are a more robust unit for the model to work with.
💡 Why token count matters for you
Every interaction with a generative AI model is measured in tokens — both the input you send and the output it returns. Models have a context window — a maximum token limit for one interaction. GPT-4o has a 128,000-token context window. Claude supports up to 200,000 tokens. Understanding tokens helps you design prompts that stay within limits and use context efficiently.
Step 2 — Tokens become numbers (embeddings)
The model cannot work with text directly. Each token is converted into an embedding — a vector of hundreds or thousands of numbers that represents the meaning of that token in context.
The critical property of embeddings: tokens with similar meanings produce vectors that are mathematically close to each other. ‘King’ and ‘Queen’ are close. ‘King’ and ‘bicycle’ are far apart. This mathematical proximity encodes semantic meaning.
The embedding is where language becomes mathematics. It is the bridge that lets a model work with meaning numerically.
Step 3 — The Transformer processes everything
The Transformer is the architecture that underpins almost every modern generative AI model — GPT, Claude, Gemini, Llama, SAP Joule. It was introduced in a 2017 paper titled ‘Attention Is All You Need’ by researchers at Google.
Before Transformers, language models processed text sequentially — left to right, one token at a time. Transformers changed this fundamentally: they process all tokens simultaneously, with every token able to attend to every other token in the sequence.
Self-attention — the key mechanism
Self-attention is how the Transformer understands context. For each token in the sequence, it computes how much attention to pay to every other token when building the representation of that token.
Consider: ‘The bank can guarantee deposits will cover future tuition costs.’ The word ‘bank’ could mean a financial institution or a river bank. Self-attention looks at ‘deposits’ and ‘tuition costs’ — other tokens in the sentence — and determines the financial meaning is correct. This happens across all tokens simultaneously.
Simplified self-attention for the word ‘bank’:
Token Attention score
‘The’ 0.02 (not very relevant)
‘guarantee’ 0.08 (somewhat relevant)
‘deposits’ 0.71 (highly relevant — financial context clue)
‘tuition’ 0.19 (relevant — supports financial meaning)
Weighted by these scores, ‘bank’ gets a representation that
reflects its financial meaning — not its river bank meaning.
Multiple attention heads
Transformers use multiple attention heads — typically 12 to 96 in modern large models. Each head learns to attend to different types of relationships simultaneously. One head might track grammatical agreement. Another might track which pronoun refers to which noun. Another might track topic consistency.
The outputs of all heads are combined into a rich representation of each token. This is why Transformers handle long-range context — a reference at the start of a long document connecting to a detail at the end — far better than earlier architectures.
Layers — building understanding level by level
A Transformer has many layers — GPT-3 has 96 layers. Each layer refines the representation of each token. Early layers capture word-level and syntactic patterns. Later layers capture semantic and conceptual relationships. The final layer’s representation is what the model uses to predict the next token.
Step 4 — The model generates one token at a time
After the Transformer processes the input, the model produces a probability distribution over every possible next token — the entire vocabulary, which can be 50,000 or more tokens.
It selects one token based on that distribution. That token is added to the context, and the entire process runs again to select the next token. This continues until the response is complete.
| What controls token selection | How it works | Effect |
|---|---|---|
| Greedy decoding | Always pick the highest-probability token | Deterministic but often repetitive — same prompt gives same answer |
| Temperature | A scaling factor on the probability distribution. Low = more concentrated, high = more spread out. | Low (0.1-0.3): predictable, conservative. High (0.8-1.0): creative, more varied. |
| Top-p sampling | Only consider tokens that together make up p% of the probability mass | Balances variety with coherence — most production systems use this |
| Top-k | Only consider the top k most likely tokens | Simple constraint on the candidate pool |
💡 Why the same prompt gives different answers
Temperature and sampling mean the model does not always pick the most likely token. Randomness is intentional — it produces more varied and natural responses. Set temperature to 0 if you need consistent, deterministic output. Increase it for more creative variety.
Why this matters for how you use AI
Understanding the generation mechanism explains behaviours that otherwise seem random:
- Hallucinations happen because the model selects the statistically plausible next token — not the verified true one. There is no lookup step.
- Responses differ between sessions because sampling introduces randomness — the same prompt at temperature 0.7 will not always produce identical output.
- Models get confused in very long conversations because earlier parts of the conversation eventually fall outside the context window.
- Asking the model to think step by step (chain-of-thought prompting) works because it forces the intermediate reasoning tokens into the context, improving the quality of the tokens that follow.
- SAP Joule, Microsoft Copilot and every other AI assistant you use runs on a transformer — the same architecture. The differences between products are in training data, fine-tuning, RAG grounding and guardrails — not in the fundamental generation mechanism.
At a glance — the mechanics
| Concept | One-line summary |
|---|---|
| Token | The unit the model works with — a word fragment, punctuation or short word |
| Context window | The maximum tokens the model can process at once — input plus output combined |
| Embedding | A numerical vector representing a token’s meaning — similar meanings produce close vectors |
| Transformer | The architecture that processes all tokens simultaneously using self-attention |
| Self-attention | For each token, deciding how much to focus on every other token to build its representation |
| Attention head | One self-attention computation — multiple heads run in parallel, each learning different relationships |
| Layers | The depth of the Transformer — early layers learn syntax, later layers learn semantics |
| Temperature | Controls randomness in token selection — 0 for deterministic, higher for creative |
| Token-by-token generation | The model produces one token at a time, adding each to context before selecting the next |
What to take away
Generative AI is not a search engine and it is not a database. It is a model that has learned statistical patterns in language at enormous scale and uses those patterns to predict the most plausible next token — one at a time.
The Transformer architecture, self-attention and embeddings are what make this work at the quality level we see today. Knowing these three concepts does not make you a researcher. It does make you a more informed user, a better prompt writer and a more credible evaluator of AI systems — whether you are working with SAP Joule, Microsoft Copilot or any other transformer-based tool.
🔗 Related posts on this site
What is a Large Language Model (LLM)? — the LLM post explains what the model is; this post explains how it generates. Together they give the complete picture. AI Hallucinations — Why They Happen — hallucination is a direct consequence of token-by-token prediction without verification — now you understand why. AI vs ML vs Deep Learning — the transformer is a deep learning architecture — that post explains the broader landscape. Fine-Tuning vs Prompt Engineering vs RAG — knowing how generation works helps you choose the right customisation approach.
Published on rakeshnarayan.com — Articles
URL: https://rakeshnarayan.com/articles/how-generative-ai-works-tokens-embeddings-and-the-transformer/


