Reports

haven't looked at implementation details. But, At a more systems level, CPU-memory to GPU-memory data transfer is a time-consuming operation. Most of times, it's more expensive than actual matrix computation in GPU itself.

Looks like, Library is somehow detecting that we are iteratively making GPU inferences, but ton of GPU memory is still available. Thus prompting us to send more data to GPU Memory.

79795280