This maybe an old post one thing u can do know is use something like florence or qwen to do object detection as (https://huggingface.co/microsoft/Florence-2-large)
Another option is to run OWL for object detection
(https://huggingface.co/docs/transformers/en/model_doc/owlv2)
You can finetune the model as well if you want to