CPU¶
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.
ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
Warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS.
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
Warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform.
Currently the CPU implementation for s390x architecture supports FP32 datatype only.
Warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
Requirements¶
- Python: 3.9 -- 3.12
- OS: Linux
- CPU flags:
avx512f,avx512_bf16(Optional),avx512_vnni(Optional)
Tip
Use lscpu to check the CPU flags.
- OS: Linux
- Compiler:
gcc/g++ >= 12.3.0(optional, recommended) - Instruction Set Architecture (ISA): NEON support is required
- OS:
macOS Sonomaor later - SDK:
XCode 15.4or later with Command Line Tools - Compiler:
Apple Clang >= 15.0.0
- OS:
Linux - SDK:
gcc/g++ >= 12.3.0or later with Command Line Tools - Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
- Build install python packages:
pyarrow,torchandtorchvision
Set up using Python¶
Create a new Python environment¶
It's recommended to use uv, a very fast Python environment manager, to create and manage Python environments. Please follow the documentation to install uv. After installing uv, you can create a new Python environment and install vLLM using the following commands:
Pre-built wheels¶
Currently, there are no pre-built CPU wheels.
Build wheel from source¶
First, install recommended compiler. We recommend to use gcc/g++ >= 12.3.0 as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
sudo apt-get update -y
sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
Second, clone vLLM project:
Third, install Python packages for vLLM CPU backend building:
pip install --upgrade pip
pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
Finally, build and install vLLM CPU backend:
If you want to develop vllm, install it in editable mode instead.
Note
If you are building vLLM from source and not using the pre-built images, remember to set LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" on x86 machines before running vLLM.
First, install recommended compiler. We recommend to use gcc/g++ >= 12.3.0 as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
sudo apt-get update -y
sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
Second, clone vLLM project:
Third, install Python packages for vLLM CPU backend building:
pip install --upgrade pip
pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
Finally, build and install vLLM CPU backend:
If you want to develop vllm, install it in editable mode instead.
Note
If you are building vLLM from source and not using the pre-built images, remember to set LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD" on x86 machines before running vLLM.
Testing has been conducted on AWS Graviton3 instances for compatibility.
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/cpu.txt
pip install -e .
Note
On macOS the VLLM_TARGET_DEVICE is automatically set to cpu, which currently is the only supported device.
Troubleshooting
If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your Command Line Tools for Xcode.
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
dnf install -y \
which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
openssl-devel openblas openblas-devel wget autoconf automake libtool cmake numactl-devel
Install rust>=1.80 which is needed for outlines-core and uvloop python packages installation.
Execute the following commands to build and install vLLM from the source.
Tip
Please build the following dependencies, torchvision, pyarrow from the source before building vLLM.
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
pip install -v \
--extra-index-url https://download.pytorch.org/whl/nightly/cpu \
-r requirements-build.txt \
-r requirements-cpu.txt \
VLLM_TARGET_DEVICE=cpu python setup.py bdist_wheel && \
pip install dist/*.whl
Set up using Docker¶
Pre-built images¶
https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo
Warning
If deploying the pre-built images on machines only contain avx512f, Illegal instruction error may be raised. It is recommended to build images for these machines with --build-arg VLLM_CPU_AVX512BF16=false and --build-arg VLLM_CPU_AVX512VNNI=false.
Build image from source¶
docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512BF16=false (default)|true \
--build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
--tag vllm-cpu-env \
--target vllm-openai .
# Launching OpenAI server
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments
docker build -f docker/Dockerfile.arm \
--tag vllm-cpu-env .
# Launching OpenAI server
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments
docker build -f docker/Dockerfile.arm \
--tag vllm-cpu-env .
# Launching OpenAI server
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments
docker build -f docker/Dockerfile.s390x \
--tag vllm-cpu-env .
# Launching OpenAI server
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=float \
other vLLM OpenAI server arguments
Related runtime environment variables¶
VLLM_CPU_KVCACHE_SPACE: specify the KV Cache size (e.g,VLLM_CPU_KVCACHE_SPACE=40means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is0.VLLM_CPU_OMP_THREADS_BIND: specify the CPU cores dedicated to the OpenMP threads. For example,VLLM_CPU_OMP_THREADS_BIND=0-31means there will be 32 OpenMP threads bound on 0-31 CPU cores.VLLM_CPU_OMP_THREADS_BIND=0-31|32-63means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting toauto, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting toall, the OpenMP threads of each rank uses all CPU cores available on the system. Default value isauto.VLLM_CPU_NUM_OF_RESERVED_CPU: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set toauto. Default value is0.VLLM_CPU_MOE_PREPACK(x86 only): whether to use prepack for MoE layer. This will be passed toipex.llm.modules.GatedMLPMOE. Default is1(True). On unsupported CPUs, you might need to set this to0(False).VLLM_CPU_SGL_KERNEL(x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is0(False).
FAQ¶
Which dtype should be used?¶
- Currently vLLM CPU uses model default settings as
dtype. However, due to unstable float16 support in torch CPU, it is recommended to explicitly setdtype=bfloat16if there are any performance or accuracy problem.
How to launch a vLLM service on CPU?¶
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 31 for the framework and using CPU 0-30 for inference threads:
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=0-30
vllm serve facebook/opt-125m --dtype=bfloat16
or using default auto thread binding:
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_NUM_OF_RESERVED_CPU=1
vllm serve facebook/opt-125m --dtype=bfloat16
How to decide VLLM_CPU_OMP_THREADS_BIND?¶
- Bind each OpenMP thread to a dedicated physical CPU core respectively, or use auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
Commands
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
# The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
1 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
2 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
3 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
4 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
5 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
6 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
7 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
8 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
9 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
10 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
11 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
12 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
13 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/basic/basic.py
- When deploy vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on a same NUMA node to avoid cross NUMA node memory access.
How to decide VLLM_CPU_KVCACHE_SPACE?¶
- This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of
weight shard sizeandVLLM_CPU_KVCACHE_SPACE, if it exceeds the capacity of a single NUMA node, the TP worker will be killed withexitcode 9due to out-of-memory.
Which quantization configs does vLLM CPU support?¶
- vLLM CPU supports quantizations:
- AWQ (x86 only)
- GPTQ (x86 only)
- compressed-tensor INT8 W8A8 (x86, s390x)
(x86 only) What is the purpose of VLLM_CPU_MOE_PREPACK and VLLM_CPU_SGL_KERNEL?¶
- Both of them requires
amxCPU flag.VLLM_CPU_MOE_PREPACKcan provides better performance for MoE modelsVLLM_CPU_SGL_KERNELcan provides better performance for MoE models and small-batch scenarios.