Your issue seems to be related to systematic degradation in accuracy over time when running bulk processing, and it’s great that you’ve already tried switching models to rule out model-specific issues. Here are a few potential causes and mitigation strategies:
Hidden Throttling or Rate Limiting
Token Usage and Context Carryover
Prompt Compression Due to Model Memory Constraints
Concept Drift or Model Adaptation Over Time
Server-Side Caching Issues
Next Steps
Run a small batch of 100-500 requests with different throttling delays to see if accuracy remains stable.
Test with different API keys or different inference servers.
Implement session resets if applicable.
Shuffle product inputs randomly to check for caching effects.