Reports

This is the way that I understand it.

Let me first illustrate the forward pass.

Given that self-attention is Softmax(QK^T)V (ignoring scaling factor in flop calculation, and sorry for the use of the same notation for different things!).

Since we only care about the flop of a single token, our query Q has size (1xQ). K and V has size (TxQ), which the query will use to interact with the neighboring tokens.

If we focus on just 1 head of 1 layer, we can ignore L and H for now. QK^T is a multiplication between (1xQ) and (QxT), which has ~2QT operations. This operation yields a single vector of size (1xT)

But there is still the operation of computing the product between Softmax(QK^T) and V. The product is between a vector (1xT) and a matrix (TxQ), which has again ~2QT operations.

Combining both steps, we get 2(2QT). Then finally we scale by the number of heads (H) and the number of layers (L), giving us 2LH(2QT) for the forward pass. If we take the backward to be twice the flop of the forward pass, we get:

2LH(2QT) (1 + 2) = 6LH(2QT) = 12LHQT flops per token.

79518017