I was asking ChatGPT a similar question and it gave me a very nice answer, so I though I would share it here in case it could help anyone.
My exact question was: "Why does attention use keys and queries? Could it not use the key representation of each element and compare it to the rest of keys instead of learning a different query representation?"
The core idea of attention is to determine how much one element in a sequence should pay attention to others. This is done by computing similarity between:
Query (what we’re looking for)
Key (what each element offers)
By comparing query and key, attention determines how relevant each key is to the query — this yields the attention score.
You're asking:
“Why can’t we just compare each key to the other keys?”
In other words:
“Why not just use a single representation for both roles?”
Asymmetry is essential:
Attention isn't just measuring similarity between tokens — it's measuring how relevant other tokens are to a specific position (the query).
This directionality matters: token A may attend to token B differently than token B attends to A.
If you use only keys (i.e. self-similarity), all interactions become symmetric, which severely reduces expressive power.
Queries allow flexible, learned focus:
The query vector lets each position decide for itself what to pay attention to — not just based on raw similarity, but based on a learned transformation of the input.
For example, in machine translation, a word may learn to query for subject-verb agreement, or semantic similarity, or positional relationships — these all require a distinct query vector.
Architectural decoupling:
By having separate projection weights for queries and keys (and values), the model has more capacity and flexibility.
It can learn richer transformations — e.g., keys can capture context, while queries capture intent or focus.
Imagine you're in a library:
Query: your mental image of the book you're looking for (e.g., "I want books on quantum computing").
Keys: the labels on the bookshelves (e.g., "Physics", "Computer Science", "Math").
If you compare your query to each key, you can find the most relevant shelf.
But if you only compared shelf labels to each other, you'd just know how similar the categories are — you wouldn’t know which one is relevant to your need.