linux/amd64
architecture, preventing potential compatibility issues when running on x86-64 cloud servers.
Paramter | Example | Description |
---|---|---|
--trust-remote-code | vllm serve state-spaces/mamba-130m-hf --trust-remote-code --host 0.0.0.0 --port 80 | Allows running models that rely on custom Python code from Hugging Face’s Model Hub. By default, this setting is False, meaning vLLM will not execute any remote code unless explicitly allowed. |
--max-model-len | vllm serve state-spaces/mamba-130m-hf --max-model-len 2048 --host 0.0.0.0 --port 80 | Sets the maximum number of tokens the model can process in a single request. This helps control memory usage and ensures proper processing of long texts. |
--tensor-parallel-size | vllm serve state-spaces/mamba-130m-hf --tensor-parallel-size 4 --host 0.0.0.0 --port 80 | Splits the model tensors across multiple GPUs to balance memory usage and allow larger models to fit into available hardware. |
--enable-prefix-caching | vllm serve state-spaces/mamba-130m-hf --enable-prefix-caching --host 0.0.0.0 --port 80 | Improves efficiency by caching repeated prompt prefixes to reduce redundant computations and speed up inference for similar queries. |
--served-model-name | vllm serve state-spaces/mamba-130m-hf --served-model-name custom-model --host 0.0.0.0 --port 80 | Specifies a custom name for the deployed model, making it easier to reference in API requests. |
--enable-reasoning | vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --enable-reasoning --reasoning-parser deepseek_r1 --host 0.0.0.0 --port 80 | Enables reasoning mode, allowing models to provide intermediate reasoning steps for better insight into the decision-making process. Requires --reasoning-parser to extract reasoning content from the model output. |