To run GLM-5 locally for a quick test, the simplest approach is to use an inference server that already supports GLM-5 and exposes an OpenAI-compatible endpoint, so you can verify the model works before you build any product integration. The official GLM-5 repo and model page both state that vLLM, SGLang, and xLLM support local deployment, and they include concrete install and serve instructions. In practice, “quick test” means: download weights, start a local server, then send a single chat/completions request to confirm tokenization, generation, and latency are sane. Start with the BF16 model if your GPUs support it; use the FP8 variant if you have the right hardware/runtime and want faster inference. Primary references: GLM-5 GitHub and GLM-5 on Hugging Face.
A straightforward vLLM-based smoke test looks like this (Linux example). First, install vLLM and the required Transformers version (the GLM-5 docs recommend upgrading Transformers from source for compatibility). Then launch the model server and call it from a tiny client:
# 1) Install vLLM (nightly) and a compatible Transformers
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git
# 2) Start a local server (adjust tensor parallel size to your GPUs)
vllm serve zai-org/GLM-5 \
--served-model-name glm-5 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90
Then, in another terminal:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model":"glm-5",
"messages":[{"role":"user","content":"Write a Python function that validates UUIDv4 strings."}],
"temperature":0.2,
"max_tokens":256
}'
If you prefer SGLang or xLLM, the GLM-5 sources provide equivalent launch patterns; the key is to pick one path and do a single end-to-end request. If your output is empty, garbled, or tool parsing fails, it’s usually a version mismatch between the serving engine and Transformers, or missing model artifacts (tokenizer/config). Keep your first test small: short prompt, low max_tokens, low temperature, and no tool calling until basic generation works.
Once local inference works, most teams quickly move to a “real app loop”: retrieval + generation + validation. Instead of pasting huge docs or code into the prompt, store your knowledge in a vector database such as Milvus or Zilliz Cloud (managed Milvus). Your quick prototype can be: embed question → retrieve top 8 chunks → ask GLM-5 to answer only from those chunks. This keeps the prompt compact and makes behavior testable. Even for a local demo, you can measure: retrieval latency, total tokens, and how often the model’s answer is supported by retrieved text. That’s the fastest way to turn “it runs” into “it’s useful.”