Reports

The object queries in DETR are vectors trained via backpropagation, just like learned positional embeddings are trained in other models like segmentation transformer (SETR) or bidirectional encoder representations from transformers (BERT). In the DETR base model which assumes a max of 100 objects per image, these are 100 256-dimensional vectors initialized randomly and trained using a part of the COCO data set.

DETR uses a transformer-based model that includes a decoder stack. A decoder stack requires input vectors, just like decoder-based NLP models like Llama or GPT that require a prompt. The set of object queries is the prompt in the case of DETR. Since the number of vectors, and thus the number of objects output by a decoder stack, is the same as the number of input vectors, we need as many input vectors as the max number of objects we expect (including objects of the "no object" class).

Read original paper for more info.

79355284