Reports

I haven't seen answer for this question anywhere, even on Ultralytics pages, so I decided to, based on last month working with the same tools, write my conclusions about that:

processing one image at a time will cause first dimension to be 1
working with N classes will be responsible for second dimension of N + 4
lastly, by default yolo model exported to tfjs format creates 8400 bounding box predictions and I don't know if it could be changed

In the beginning, my research wouldn't be possible without "comments" section at this page: Ultralytics tfjs integration. After doing some research using those tips, I chose to download prediction data in .txt format and analyze it in python.

This image shows all values in those predictions: output of yolo11n model in tfjs format Seeing those values at the beginning (between 0 and 640) and after index ~40000 (around 0) we can conclude that first few rows are about dimensions, and other about confidence. From this image: first four rows we can see that those dimensions are x_min, y_min, x_max, y_max.

To keep this answer clearer, I will post images only of first and last 100 boxes to see how they look like: first 100 bounding boxes, last 100 bounding boxes. It can be seen that detection started at upper left corner and is going to almost whole image. Because of me being in the middle of image, and taking the most space of it, confidence scores for class 0 ("person") from COCO dataset are around 0 for first ~8200 boxes, and higher than 0.8 for those at the end: confidence scores of class 0, confidence scores of other classes are negligible as you can see here: confidence scores of class "bus".

Taking only max values for every class (reshaping the data to 84 x 8400 and computing the max along axis 1) will give us clear view of what was taken on image: maximum confidence per class.

79260688