Language & LLMs
What Is Multi-Head Attention?
Multi-head attention splits the attention mechanism into several parallel heads, each learning to focus on different aspects of the input. The outputs of all heads are combined to form a richer representation. This lets transformers capture multiple types of relationships between tokens simultaneously.
Further reading
Read more about multi-head attention — articles and blogs from around the web: