when you run vllm: example
export MAX_MODEL_LEN=16000
vllm serve $HOME/Documents/work/large-language-models/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --port 8000 --max-model-len=$MAX_MODEL_LEN --uvicorn-log-level debug
if you don't specify explicity MAX_MODEL_LEN, it will use default value of model that you use.
--max-model-len
Model context length. If unspecified, will be automatically derived from the model config.
It's causing your GPU to run out of memory. You need to limit it by setting export MAX_MODEL_LEN=<appropriate value>
thís problem is related to server-side, VLLM of langchain don't use max_model_length like you. [example](https://github.com/langchain-ai/langchain/blob/c74e7b997daaae36166e71ce685850f8dc9db28e/docs/docs/integrations/llms/vllm.ipynb#L278)
llm = VLLM(
model="meta-llama/Llama-3.2-3B-Instruct",
max_new_tokens=300,
top_k=1,
top_p=0.90,
temperature=0.1,
vllm_kwargs={
"gpu_memory_utilization": 0.5,
"enable_lora": True,
"max_model_len": 350,
},
)
it use vllm_kwargs, the config is configured for client-side.
Reference