Reports

when you run vllm: example

export MAX_MODEL_LEN=16000

vllm serve $HOME/Documents/work/large-language-models/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --port 8000 --max-model-len=$MAX_MODEL_LEN --uvicorn-log-level debug

if you don't specify explicity MAX_MODEL_LEN, it will use default value of model that you use.

--max-model-len
Model context length. If unspecified, will be automatically derived from the model config.

It's causing your GPU to run out of memory. You need to limit it by setting export MAX_MODEL_LEN=<appropriate value>

thís problem is related to server-side, VLLM of langchain don't use max_model_length like you. [example](https://github.com/langchain-ai/langchain/blob/c74e7b997daaae36166e71ce685850f8dc9db28e/docs/docs/integrations/llms/vllm.ipynb#L278)

llm = VLLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_new_tokens=300,
    top_k=1,
    top_p=0.90,
    temperature=0.1,
    vllm_kwargs={
        "gpu_memory_utilization": 0.5,
        "enable_lora": True,
        "max_model_len": 350,
    },
)

it use vllm_kwargs, the config is configured for client-side.

Reference

https://github.com/vllm-project/vllm/blob/40677783aa1f59019424ff1828c54b696e4cfc3a/vllm/engine/arg_utils.py#L384

79510265