Running Real DeepSeek R1 671B on a Gaming PC — and Testing It on a 160k-Token Context

A hands-on experiment running the full 671-billion-parameter DeepSeek R1 model on consumer hardware — an i7-14700, 192 GB DDR5 RAM, and a single RTX 4090 — using aggressive IQ1_S_R4 quantization. The model maintains coherent reasoning even at the maximum 160k-token context window.

Almost all mainstream local models are launched in essentially the same way. This article focuses on running the full DeepSeek R1 671B model on consumer hardware and explores whether it stays coherent at very large context sizes — specifically 160,000 tokens, the model's published maximum.

Hardware Used

CPU: Intel Core i7-14700 (20 cores)
RAM: 192 GB DDR5 at 4800 MT/s (4 × 48 GB)
GPU tested separately: RTX 4090 (24 GB VRAM) and RTX 4060 Ti (16 GB VRAM)

Hardware setup for running DeepSeek R1 671B

The Core Tool: ik_llama.cpp

The standard inference engine for quantized models is llama.cpp, which uses the GGUF format. For this experiment the author uses ik_llama.cpp — a fork that significantly improves CPU performance and adds specific optimisations for Mixture-of-Experts (MoE) architectures like DeepSeek R1.

The Key Technique: Selective Tensor Offloading (`-ot`)

The breakthrough that makes this setup feasible is the -ot (override-tensor) parameter. DeepSeek R1 is a MoE model: at any given step only a small subset of its expert networks are activated. This creates a natural split:

GPU: Attention tensors — lightweight, used on every single forward pass.
CPU: FFN expert weights — very large, but each expert is accessed variably across tokens.

This hybrid placement doubles generation speed compared to a pure CPU run, achieving roughly 7 tokens per second.

Tensor offloading diagram — attention on GPU, experts on CPU

The Quantization: IQ1_S_R4 at 130 GB

The full-precision DeepSeek R1 model requires roughly 1.3 TB of storage. The IQ1_S_R4 quantization compresses it to approximately 130 GB — about a 10× reduction — while preserving more quality than simpler 1-bit schemes thanks to a refined rounding strategy. The author measures quality using KL Divergence (KLD) rather than perplexity, because KLD compares the full output probability distribution token-by-token rather than averaging across the sequence, making it more sensitive to subtle degradation.

Multi-Head Latent Attention (MLA)

DeepSeek R1 uses Multi-Head Latent Attention, which compresses the KV-cache by approximately 25× with no measurable quality loss. This is critical: without MLA, the KV-cache alone for a 160k-token context would consume tens of gigabytes of VRAM, making the experiment impossible on consumer hardware.

RTX 4090: supports up to 80k context at 200–300 tokens/sec prefill speed
RTX 4060 Ti: supports up to 32k context at around 60 tokens/sec prefill speed

Building ik_llama.cpp

Single GPU build:

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF
cmake --build build --config Release -j28

Multi-GPU build with MoE optimisations:

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF \
  -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build build --config Release -j28

Launch Command

./llama-server \
  -m "DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf" \
  -mla 3 -fa \
  -ctk q8_0 \
  -amb 512 \
  -fmoe \
  -ot exps=CPU \
  -ngl 99 \
  -b 4096 -ub 4096 \
  -t 20 \
  -c 163840

Key flags explained:

-mla 3 — enable Multi-Head Latent Attention with absorption mode 3
-fa — Flash Attention for reduced VRAM usage
-ctk q8_0 — quantize the KV-cache to 8-bit to save VRAM
-fmoe — enable fused MoE kernel for faster expert dispatch
-ot exps=CPU — place all expert tensors on the CPU
-ngl 99 — offload all non-expert layers to GPU
-c 163840 — 160k token context window

llama-server running with DeepSeek R1 671B

Real-World Context Test: A Full Novel

To validate long-context coherence, the author loaded the entire Russian fantasy novel Labyrinth of Reflections (Лабиринт Отражений) by Sergei Lukyanenko — roughly 215,000 tokens — into the model's context. The quantized model was asked questions about passages from the middle of the book, character arcs, and plot details spanning the full narrative.

Results were impressive: the model accurately quoted passages from the middle of the book, correctly described character development across the entire story, and maintained factual consistency despite the 100k+ token context. Degradation only appeared when a rolling-buffer context shift was forced, causing the model to lose access to earlier text and begin fabricating citations.

Model correctly quoting from a Russian novel at 100k token context

Comparative Models Tested

DeepSeek V3 (Lighter Alternative)

An IQ1_S_R4 quantized version of DeepSeek V3 is available and fits in less RAM, but shows noticeably degraded KLD metrics compared to R1. For long-context reasoning tasks, R1 remains the better choice.

Llama 4 Maverick (MoE, 401B total / 17B active)

Llama 4 Maverick also uses a MoE architecture and supports up to 215k tokens via Sliding Window Attention (SWA). It maintains accuracy through 70k-token tests but shows degradation when queries reference text beyond the attention window — a fundamental limitation of the SWA design.

Gemma 3 27B (Dense Model)

A dense (non-MoE) model tested for comparison. Gemma 3 27B performs poorly on long contexts: it begins looping at 32k+ tokens and requires aggressive KV-cache quantization just to run. It is not suitable for 100k+ context tasks despite advertising SWA support.

Comparison of model performance at various context lengths

Performance Summary

Generation speed (hybrid CPU+GPU): ~7 tokens/sec
Prefill speed on RTX 4090 with batch 4096: 200–300 tokens/sec
Generation speed at 160k context (CPU-bound): ~0.5 tokens/sec
Memory bottleneck: DDR5 at 4800 MT/s provides ~70 GB/s bandwidth; 100+ GB/s would noticeably improve throughput

Tools for Less Technical Users

LM Studio — graphical interface, the easiest starting point
Jan — OpenAI API-compatible desktop client
text-generation-webui (oobabooga) — advanced parameter control, supports the -ot override-tensor feature

Conclusion

Even at the maximum 160k-token context, the aggressively quantized IQ1_S_R4 version of DeepSeek R1 responds coherently. The combination of MoE architecture, Multi-Head Latent Attention, and selective tensor offloading makes running a 671B-parameter model on consumer hardware not just possible, but practically useful. The experiment exceeded expectations: 192 GB of DDR5 RAM plus a single consumer GPU is enough to have a capable reasoning model available locally.

Running Real DeepSeek R1 671B on a Gaming PC — and Testing It on a 160k-Token Context

Hardware Used

The Core Tool: ik_llama.cpp

The Key Technique: Selective Tensor Offloading (`-ot`)

The Quantization: IQ1_S_R4 at 130 GB

Multi-Head Latent Attention (MLA)

Building ik_llama.cpp

Launch Command

Real-World Context Test: A Full Novel

Comparative Models Tested

DeepSeek V3 (Lighter Alternative)

Llama 4 Maverick (MoE, 401B total / 17B active)

Gemma 3 27B (Dense Model)

Performance Summary

Tools for Less Technical Users

Conclusion

Further reading

Why Airships Never Took Off. Part 12: Italian Semi-Rigid Airships

Why Airships Never Took Off. Part 11: Aircraft Carriers in the Sky

Why Airships Never Took Off. Part 10: The Most Famous and Successful Zeppelin

Why Airships Never Took Off. Part 9: Ashes of War and New Opportunities

Hardware Used

The Core Tool: ik_llama.cpp

The Key Technique: Selective Tensor Offloading (-ot)

The Quantization: IQ1_S_R4 at 130 GB

Multi-Head Latent Attention (MLA)

Building ik_llama.cpp

Launch Command

Real-World Context Test: A Full Novel

Comparative Models Tested

DeepSeek V3 (Lighter Alternative)

Llama 4 Maverick (MoE, 401B total / 17B active)

Gemma 3 27B (Dense Model)

Performance Summary

Tools for Less Technical Users

Conclusion

Further reading

Why Airships Never Took Off. Part 12: Italian Semi-Rigid Airships

Why Airships Never Took Off. Part 11: Aircraft Carriers in the Sky

Why Airships Never Took Off. Part 10: The Most Famous and Successful Zeppelin

Why Airships Never Took Off. Part 9: Ashes of War and New Opportunities

The Key Technique: Selective Tensor Offloading (`-ot`)