From experience - it is not so simple. You need to have take into account:
- engine used for inference (TGI? pure transformers? llama-cpp)
- card type (really it matters whether it is H100 or L40S or A100)
- batch size
- is it a chatbot like experience or maybe you need to process offline?
- What is the maximum context you would like to process?
On basis of this you need to run some benchmark and generalize it