The hardware requirements for GPT-OSS models are designed to be accessible: gpt-oss-120b fits into a single H100 GPU, while gpt-oss-20b runs within 16GB of memory. The native MXFP4 quantization for the MoE layer makes gpt-oss-120b run on a single H100 GPU and the gpt-oss-20b model run within 16GB of memory. This represents a significant advancement in efficiency, as most models of comparable capability typically require multiple high-end GPUs or substantially more memory.
For the larger model, the optimized Triton implementation can run gpt-oss-120b on a single 80GB GPU using the expandable memory allocator. The documentation indicates that while a single H100 is sufficient, the basic PyTorch reference implementation requires more resources, needing 4xH100 or 2xH200 GPUs due to lack of optimization. For the smaller model, OpenAI recommends at least 16GB RAM to run gpt-oss-20b, but systems with more RAM will obviously perform better, with 16GB being essentially the minimum floor for experimentation.
The models also support deployment on Apple Silicon hardware through a Metal implementation, though this is described as not production-ready but accurate to the PyTorch implementation. The MXFP4 quantization scheme is particularly important here, as it enables efficient inference while maintaining model quality. Both models use a 4-bit quantization scheme (MXFP4) applied only to the MoE weights, enabling fast inference thanks to fewer active parameters while keeping resource usage low. This technical approach makes the models remarkably accessible compared to other frontier-class language models that typically require enterprise-grade hardware configurations.
For more detailed information, see: GPT-oss vs o4-mini: Edge-Ready, On-Par Performance — Dependable, Not Mind-Blowing