79239152

Date: 2024-11-30 09:04:50
Score: 1
Natty:
Report link

From experience - it is not so simple. You need to have take into account:

  1. engine used for inference (TGI? pure transformers? llama-cpp)
  2. card type (really it matters whether it is H100 or L40S or A100)
  3. batch size
  4. is it a chatbot like experience or maybe you need to process offline?
  5. What is the maximum context you would like to process?

On basis of this you need to run some benchmark and generalize it

Reasons:
  • No code block (0.5):
  • Contains question mark (0.5):
Posted by: artona