Running GPT-OSS-120B on a 6 GB GPU and Accelerating to 30 t/s

A practical guide to running the massive GPT-OSS-120B MoE model on consumer GPUs with as little as 6 GB VRAM by leveraging the -cmoe flag in llama.cpp to offload expert layers to CPU while keeping attention on GPU.

More and more large MoE (Mixture of Experts) models with a small number of active parameters are being released. This is great news for local LLM enthusiasts — it means you can run powerful models even on GPUs with limited VRAM. Today we'll look at how to run GPT-OSS-120B on modest hardware and achieve comfortable generation speeds.

GPT-OSS-120B

MoE vs Dense: What's the Difference?

The key distinction of MoE architecture from Dense models is that the total number of parameters significantly exceeds the number of parameters activated to generate each new token. In MoE models, there are isolated expert sub-networks, and a learned routing network selects which experts process each specific token.

MoE architecture diagram

In Dense models, all parameters in every layer are activated for every token. In MoE models, only a subset is used. GPT-OSS-120B activates just 4 experts out of 128 available, requiring approximately 24 times fewer computational resources than an equivalent dense model with the same total parameter count.

The Problem: VRAM Waste

When you load a model onto GPU using the standard -ngl (number of GPU layers) parameter, entire layers are loaded into VRAM. But in MoE models, each layer contains dozens or hundreds of experts, and only a few are actually used per token. This means most of the VRAM is occupied by "sleeping" experts that aren't contributing to the current computation.

VRAM usage

The Solution: The -cmoe Parameter

The -cmoe (or --cpu-moe) flag in llama.cpp redistributes tensors strategically. Instead of loading entire layers to GPU, this parameter:

  • Keeps attention tensors and shared layers on GPU — these are used for every token
  • Moves expert parameters to CPU/RAM — since only a few are active per token, CPU access speed is sufficient
  • Increases GPU utilization efficiency
  • Reduces wasted VRAM on inactive experts
  • Maintains continuous GPU engagement for the operations that matter most
cmoe architecture

Basic Launch Command

.\llama-server.exe -m "path/to/model" -fa 1 -cmoe -ngl 99 -c 8192 --jinja

Let's break down the flags:

  • -fa 1 — Flash Attention, optimizes memory usage for context
  • -cmoe — CPU MoE mode, offloads experts to RAM
  • -ngl 99 — load all layers to GPU (attention parts stay on GPU, experts go to CPU)
  • -c 8192 — context window size
  • --jinja — enables chat template formatting

Benchmark Results

Testing on a system with Intel i7-14700, NVIDIA RTX 4090 (24 GB VRAM):

Benchmark results
  • Standard -ngl (no optimization): 18.9 t/s
  • With -cmoe -ngl 99: 24.3 t/s (+29% improvement)
  • With -ncmoe 23: 34 t/s (+80% improvement!)

The -ncmoe parameter specifies exactly how many layers to keep in CPU MoE mode, giving finer control over the GPU/CPU split. With -ncmoe 23, the model uses only about 3 GB of VRAM instead of 23 GB — a massive reduction!

VRAM comparison

Budget Hardware: AMD RX 6600

You don't need an RTX 4090 to benefit from this approach. Testing on a budget system with AMD RX 6600 (8 GB), Ryzen 5600G, 64 GB DDR4-3600:

RX 6600 results

Result: 13.1 t/s — a perfectly comfortable speed for interactive local use. This proves that MoE models truly democratize access to large language models beyond high-end hardware.

Optimized Launch for Maximum Throughput

.\llama-server.exe -hf unsloth/gpt-oss-120b-GGUF:Q8_K_XL -fa 1 -ncmoe 25 -ngl 99 -ub 4092 -b 4092 -c 65536 --jinja

Additional Optimization Parameters

  • -ub 4096 -b 4096 — increases batch sizes for faster context processing (prompt evaluation)
  • -ctk q8_0 -ctv q8_0 — KV-cache quantization when VRAM is constrained; reduces memory usage at a small quality cost
  • -fa 1 — Flash Attention is essential for efficient long-context handling
Optimization parameters

Supported MoE Models

The -cmoe technique works with many MoE models available today. Here's a catalog of notable ones:

Model catalog
  • Qwen3 series: From 4B to 480B parameters, with active parameters from 2.8B to 35B
  • GLM-4.5-Air: A competitive MoE model with efficient expert routing
  • DeepSeek V3.1: 671B total / 37B active parameters
  • Llama-4-Maverick: Meta's MoE entry
  • Kimi K2: From Moonshot AI
  • GPT-OSS-120B: 120B total / ~5B active — the star of this article
Model comparison

Models with small active parameter counts (3B-5B) can even run entirely on CPU for users without dedicated GPUs, though at lower speeds.

CPU-only performance

Conclusion

The MoE architecture combined with the -cmoe optimization in llama.cpp is a game-changer for local LLM deployment. You no longer need a $2,000 GPU to run a 120-billion-parameter model. A mid-range GPU with 6-8 GB of VRAM and sufficient system RAM is enough to achieve practical generation speeds of 13-34 tokens per second. The key insight is simple: if only 4 out of 128 experts are active at any time, there's no reason to waste precious VRAM storing the other 124.

Summary