79637687

Date: 2025-05-25 13:10:18
Score: 2
Natty:
Report link

I was asking ChatGPT a similar question and it gave me a very nice answer, so I though I would share it here in case it could help anyone.

My exact question was: "Why does attention use keys and queries? Could it not use the key representation of each element and compare it to the rest of keys instead of learning a different query representation?"

Answer:

The core idea of attention is to determine how much one element in a sequence should pay attention to others. This is done by computing similarity between:

Query (what we’re looking for)
Key (what each element offers)

By comparing query and key, attention determines how relevant each key is to the query — this yields the attention score.

You're asking:

“Why can’t we just compare each key to the other keys?”
In other words:
“Why not just use a single representation for both roles?”

Here's why that's limiting:

  1. Asymmetry is essential:

    • Attention isn't just measuring similarity between tokens — it's measuring how relevant other tokens are to a specific position (the query).

    • This directionality matters: token A may attend to token B differently than token B attends to A.

    • If you use only keys (i.e. self-similarity), all interactions become symmetric, which severely reduces expressive power.

  2. Queries allow flexible, learned focus:

    • The query vector lets each position decide for itself what to pay attention to — not just based on raw similarity, but based on a learned transformation of the input.

    • For example, in machine translation, a word may learn to query for subject-verb agreement, or semantic similarity, or positional relationships — these all require a distinct query vector.

  3. Architectural decoupling:

    • By having separate projection weights for queries and keys (and values), the model has more capacity and flexibility.

    • It can learn richer transformations — e.g., keys can capture context, while queries capture intent or focus.

Analogy:

Imagine you're in a library:

But if you only compared shelf labels to each other, you'd just know how similar the categories are — you wouldn’t know which one is relevant to your need.

Reasons:
  • RegEx Blacklisted phrase (1): I want
  • Long answer (-1):
  • No code block (0.5):
  • Contains question mark (0.5):
  • Low reputation (1):
Posted by: dunc