DeepSeek-V3.2-Exp is a 685B-parameter MoE model with ~37B active parameters per token, which places it firmly in the category of data-center-class models. Running the full model in real time requires multiple high-end GPUs (A100/H100/H200-class) connected with high-bandwidth interconnects. Even at 4-bit precision, total VRAM requirements exceed 350–400 GB, meaning typical setups use distributed inference with tensor parallelism. In full precision, the memory requirement exceeds 1 TB of VRAM, confirming that single-GPU or workstation-level hardware is insufficient for production-speed inference.
Developers who want to run DeepSeek-V3.2 locally generally have three realistic options. The first is to host a multi-GPU inference cluster using vLLM or similar frameworks, which can shard the MoE layers and KV caches across many GPUs. The second is to run quantized versions (4-bit or FP8) on smaller GPU clusters; these variants can run on fewer GPUs but will still require at least 2–4 high-end cards for practical speeds. The third option is hybrid: run DeepSeek via API while self-hosting surrounding services such as retrieval, embeddings, and business logic.
If you plan to pair DeepSeek with a retrieval layer, factor in additional hardware for the vector database. Systems built on Milvus or Zilliz Cloud typically rely more on CPU, RAM, and disk bandwidth rather than GPU memory, so that workload can be offloaded to cheaper hosts. This split—GPU nodes for inference, CPU/RAM nodes for vector search—is a common production architecture. For most teams, unless you already operate multi-GPU clusters, using DeepSeek’s API and running your vector database locally gives you the best trade-off between performance and cost.