The basic unit of text that an LLM processes. A token can be a whole word, part of a word, a space, or even a single character. For example, "unbelievable" might be split into ["un", "believ", "able"]. LLMs don't read words — they read tokens. Typical LLMs use 30,000–100,000+ unique tokens. See Lesson 1.
The process of converting raw text into a sequence of tokens. Different models use different tokenization algorithms (e.g., BPE, WordPiece, SentencePiece). The same text can produce different token counts in different models. See Lesson 1.
A dense vector of numbers (e.g., 4,096 floats) that represents the "meaning" of a token in a high-dimensional space. Tokens with similar meanings have embeddings that are close together in this space. Embeddings are learned during training. See Lesson 1.
The complete set of tokens that a model knows. Each token is assigned a unique integer ID (0, 1, 2, …). The vocabulary is a fixed lookup table — the model cannot process tokens it hasn't seen during training. See Lesson 1.
The maximum number of tokens an LLM can process in a single input+output. E.g., Claude 3.5 Sonnet has a 200K token context window. Text beyond this limit is invisible to the model. One of the most important constraints for application developers.
The neural network architecture underlying all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need." Its key innovation is the self-attention mechanism, which lets the model weigh the importance of different parts of the input when producing each output token.
The core mechanism of Transformers. For each token in a sequence, it computes how much that token should "pay attention to" every other token. It works by projecting each token's embedding into three vectors — Query (Q), Key (K), and Value (V) — then computing attention scores via dot products of Q and K, and producing an output as a weighted sum of all V vectors. This allows every token to incorporate information from the entire context. See Lesson 2.
Three vectors derived from each token's embedding in self-attention. Query represents "what am I looking for?", Key represents "what am I?", and Value represents "what information do I provide?". The dot product of Query and Key gives the attention score (relevance), which is then used to weight the Value vectors. See Lesson 2.
A mathematical function that converts a vector of arbitrary numbers into a probability distribution (all values between 0 and 1, summing to 1). In attention, softmax is applied to the raw attention scores to produce attention weights. High scores get amplified, low scores get suppressed. Also used in the output layer to produce token probabilities.
Running multiple self-attention operations in parallel, each with different weight matrices. Each "head" can learn to focus on different types of relationships (syntax, semantics, coreference, etc.). The outputs of all heads are concatenated and projected. GPT-3 uses 96 attention heads. See Lesson 2.
A mechanism in autoregressive models (like GPT) that prevents tokens from attending to future positions. Implemented by setting attention scores for future tokens to negative infinity before softmax. This ensures the model can only use information from current and previous tokens when generating the next token, maintaining the left-to-right generation property. See Lesson 2.