The Go Scheduler: The Most Detailed Guide in Plain Language

A deep dive into how the Go runtime scheduler works — from the fundamental problem of OS threads all the way to work stealing, the network poller, and asynchronous preemption — built up step by step so that every design decision makes intuitive sense.

Go is built around three core concurrency primitives: goroutines that can run concurrently and independently of each other, a scheduler that manages them, and channels that help goroutines exchange data. This guide focuses on the scheduler — arguably the most important and least-understood piece of the runtime.

Before diving in, a quick note on prerequisites: to get the most out of this article you should already be comfortable with goroutines and basic Go concepts.

The Foundation: Concurrency vs. Parallelism

These two terms are often used interchangeably but mean different things. Concurrency is about structure — it describes independent processes that can make progress without waiting on each other. Parallelism is about execution — it means multiple things are running at the exact same instant on multiple CPU cores.

You can have concurrency without parallelism (a single core interleaving tasks), parallelism without concurrency (multiple cores all doing the exact same thing), or both at once. Go aims for both.

OS Threads and Why They Are Expensive

At the operating system level, a thread can be in one of three states:

  • Executing — running on a CPU core right now.
  • Runnable — ready to run but waiting for an available core.
  • Waiting — blocked on I/O, a lock, or a system call.

The OS scheduler moves threads between these states. The operation of saving one thread's state and loading another's is called a context switch. Context switches are expensive: they involve saving and restoring CPU registers, flushing CPU caches, and kernel-mode transitions. When an application creates thousands of threads, context-switch overhead can dominate total CPU time.

There is also a memory cost: each OS thread reserves roughly 2 MB for its stack. A program with 10,000 threads therefore consumes ~20 GB just for stacks — before doing any work.

Building the Go Scheduler, Step by Step

The Go scheduler did not spring into existence fully formed. Understanding why it is designed the way it is requires walking through the problems it was designed to solve, one at a time.

Step 1 — The 1:1 Model: One Thread per Goroutine

The simplest possible design: every goroutine gets its own OS thread. Simple to reason about, but fatally flawed at scale. With 100,000 goroutines you need 100,000 threads, 200 GB of stack memory, and catastrophic context-switch overhead.

Step 2 — Thread Pooling: Reuse Instead of Destroy

Rather than destroying a thread when its goroutine finishes, keep the thread alive and assign it a new goroutine. This eliminates the cost of thread creation and destruction, but the fundamental 1:1 ratio remains — and with it, the memory and context-switch problems.

Step 3 — The M:N Model: Many Goroutines, Few Threads

Limit the number of OS threads to something close to the number of CPU cores (configurable via runtime.GOMAXPROCS()). Maintain a Global Run Queue (GRQ) of goroutines waiting to be scheduled. Each OS thread picks a goroutine from the queue, runs it, and when the goroutine yields or blocks, picks the next one.

This is the M:N model: M goroutines multiplexed onto N OS threads. The memory problem is solved (goroutine stacks start at 2–8 KB and grow dynamically), and context switches happen in user space rather than the kernel — much cheaper.

But a new problem appears: all threads contend for the single global queue under a mutex. With many cores, this mutex becomes a scalability bottleneck.

Step 4 — Local Run Queues: Introducing the P

The solution is to introduce a new abstraction — the Processor (P). Each P owns a local run queue (LRQ) of goroutines. The three entities of the GMP model are now:

  • G (Goroutine) — the unit of concurrent work.
  • M (Machine) — an OS thread.
  • P (Processor) — a logical processor that holds a local run queue and acts as the bridge between G and M.

The number of Ps is set by GOMAXPROCS and defaults to the number of available CPU cores. Each M must hold a P to run goroutines. When a goroutine is created it is placed in the current P's local queue. Goroutines in the LRQ do not need a global lock to be scheduled — eliminating the bottleneck.

Step 5 — Work Stealing: Keeping Every Core Busy

With local queues, a new problem surfaces: one P's queue might be empty while another's is overflowing. Idle Ps waste CPU capacity.

The solution is work stealing: when a P's local queue is empty, it first checks the Global Run Queue, then randomly picks another P and steals half of its goroutines. This keeps all cores busy with minimal coordination.

The Global Run Queue is still checked, but only periodically — once every 61 scheduling ticks — to avoid starvation of goroutines that end up there.

Step 6 — Syscall Handoff: Blocking Without Blocking

When a goroutine makes a blocking system call (like reading from a file), the underlying OS thread blocks. If that M holds a P, the P becomes stranded — no other goroutines can run on it while the syscall completes.

The handoff mechanism solves this: when a goroutine blocks on a syscall, its P is detached from the blocking M and handed off to a different M (possibly a newly created one). The other goroutines in the LRQ can now continue running. When the syscall eventually completes, the original goroutine tries to reacquire a P; if none is available it goes to the Global Run Queue.

Step 7 — The Network Poller: Asynchronous I/O Without Blocking Threads

The handoff mechanism works for blocking syscalls, but network I/O is different — and very common in Go programs. Blocking a thread on every network read or write would be extremely wasteful.

Go uses the OS's asynchronous I/O mechanisms (epoll on Linux, kqueue on macOS/BSD, IOCP on Windows) through an internal component called the Network Poller. When a goroutine performs a network operation that would block, it is parked and registered with the Network Poller instead of blocking its M. The M is immediately freed to run other goroutines. When the OS signals that the I/O is ready, the goroutine is placed back into a run queue and eventually scheduled again.

This is why Go can efficiently handle hundreds of thousands of concurrent network connections with a small, fixed pool of OS threads.

Step 8 — Preemption: Fairness and Latency

All the mechanisms above assume goroutines cooperate — that they occasionally yield control. But what if a goroutine runs a tight compute loop with no function calls, no I/O, and no channel operations? Without preemption it would monopolize its P forever, starving all other goroutines on that processor.

Early versions of Go relied on cooperative preemption: function call preambles checked a stack-growth guard and yielded if needed. This failed for loops with no function calls.

Go 1.14 introduced asynchronous preemption: a background monitoring goroutine called sysmon sends a SIGURG signal to threads running goroutines that have been executing for more than 10 ms. The signal handler records a safe preemption point, and the goroutine is forced to yield at the next opportunity — even mid-loop. This guarantees fair scheduling and bounded worst-case latency regardless of what user code does.

Sysmon: The Scheduler's Watchdog

Sysmon is a special OS-level thread (it does not need a P) that runs in the background and performs several housekeeping tasks:

  • Detects goroutines that have been running too long and marks them for preemption.
  • Polls the Network Poller for completed I/O and moves the associated goroutines back into run queues.
  • Retakes Ps from threads that have been blocked in syscalls for too long.
  • Forces GC-related operations if the GC has not run recently.

The Full Picture

Putting it all together, the Go scheduler is a user-space, work-stealing, M:N scheduler built around the GMP model. Its key properties:

  • Goroutines are cheap to create (small initial stack, no kernel involvement) and cheap to switch between (user-space context switch).
  • Local run queues minimize lock contention across cores.
  • Work stealing keeps all processors busy.
  • Syscall handoff prevents blocking calls from stranding local queues.
  • The Network Poller handles async I/O without tying up OS threads.
  • Asynchronous preemption (Go 1.14+) guarantees fairness even for CPU-bound loops.

This design is what allows a Go program to spawn a million goroutines, saturate all CPU cores, and still handle network I/O efficiently — all without the programmer needing to think about thread pools, mutexes on queues, or I/O completion callbacks.

FAQ

What is this article about in one sentence?

This article explains the core idea in practical terms and focuses on what you can apply in real work.

Who is this article for?

It is written for engineers, technical leaders, and curious readers who want a clear, implementation-focused explanation.

What should I read next?

Use the related articles below to continue with closely connected topics and concrete examples.