# Lesson 2: Self-Attention — Core mechanism taught

Covered Self-Attention (自注意力机制): the Q/K/V mechanism, dot product attention scores, softmax normalization, weighted sum of values, multi-head attention, and causal masking. Used the "library search" analogy for Q/K/V. Connected causal masking to autoregressive generation and O(n²) complexity to why long context is expensive.

**Implications**: Self-Attention is the foundation for everything that follows (output layer, generation, positional encoding). User should be comfortable with the Q/K/V intuition before moving on. The causal masking discussion naturally leads into next lesson about the output layer and generation process. Key concern: the math (dot products, softmax) is introduced gently but user may need reinforcement — watch for confusion in future lessons.
