Essays

Neural Networks for Beginners. Part Zero. An Overview

A comprehensive introduction to neural network architecture and large language models, covering everything from basic neurons and layers to attention mechanisms, training via backpropagation, and parallelization strategies needed for training on GPU clusters.

What Is a Neural Network?

At its core, a neural network consists of three main components: neurons, layers with weights, and activation functions. There is no magic here — this is just many simple operations with numbers performed on specialized chips.

Neurons

The concept of artificial neurons was inspired by biological research from the 1940s. A neuron can be thought of as a simple function that takes input signals from several neighbors, computes a result, and if the result exceeds a certain threshold, the neuron activates. Each circle in the simple example below is a separate neuron in an artificial neural network.

Natural neurons transmit signals through synapses, while artificial ones model this process mathematically through weighted connections between nodes in the network.

Layers and Weights

Neural networks are organized into structured layers: input, hidden, and output. Each layer contains neurons connected to the next layer through connections that have specific weights.

From layer to layer, the model recognizes increasingly complex patterns and relationships in the input data. The weights of these connections are the model's parameters, computed during training and remaining unchanged during use.

Activations

Without nonlinear activation functions between layers, all layers could be mathematically collapsed into one. The freedom, flexibility, and creativity of neural networks is due to the fact that in each layer there is also a nonlinear activation function.

Activation functions (sigmoid, ReLU, GELU) amplify useful connections and weaken parasitic signals, ensuring the system's nonlinearity.

Concrete Example — Multi-Layer Perceptron (MLP)

MLP is the simplest neural network architecture for recognizing handwritten digits. The system consists of an input layer (784 neurons for a 28x28 pixel image), two hidden layers of 16 neurons each, and an output layer (10 neurons for digits 0-9).

Each connection between layers has a weight that determines the connection strength. The total number of parameters is approximately 13,002 (weights plus biases).

In the first hidden layer, the neurons that light up are those that detect different simple shapes, while in the second layer, neurons that recognize more complex structures activate, allowing the model to identify digits.

Principles of Neural Network Operation

Inference

Inference is the process of using an already trained neural network. When you ask a model to perform a task (write code, summarize text, generate an image), inference is happening. The model can run on a single server with several GPUs, and responses should come in seconds.

Tokenization

The first thing the model does is break the phrase into parts equal to one token. This is the splitting of text into minimal processing units. Tokens can be whole words, word parts, or individual characters. For GPT-3, the vocabulary contains 50,257 tokens.

Embedding

Converting tokens into numerical vectors. Each word is mapped to a vector of numbers with a dimensionality equal to the model's dimension (for GPT-3, this is 12,288). The closer words are in meaning, the more similar their vectors are. Positional embeddings are added to vocabulary embeddings, describing the word's position in the text.

Attention Mechanism (Self-Attention)

A mechanism that allows the model to find relationships between words. For each token, three vectors are computed: Query (what I want to know), Key (what I know), and Value (what I'm reporting). Dot products of all pairs of Q_i and K_j are computed, where i and j are all tokens in the phrase. The results are transformed into probabilities via softmax, determining the strength of one word's influence on another.

FFN and Nonlinear Transformations

Feed Forward Network — a two-layer micro-network inside each transformer block. The first multiplication increases the token vector's dimensionality by 4 times (in GPT-3, from 12,288 to 49,152). Between the multiplications, a nonlinear activation function (GELU, ReLU) is applied, which amplifies significant connections and weakens noise.

Layer Norm and Residual Connection

Layer Norm normalizes the vector, bringing the mean value to zero. It uses two parameter vectors for scaling and shifting. Residual connection adds the original input vector to the block's output vector, improving stability and model training by combating vanishing gradients.

Output Layer

For text generation tasks, only the vector of the LAST token is needed. This vector is multiplied by a weight matrix, creating logits (confidence scores) for all vocabulary tokens. Softmax is applied to the logits, converting them into probabilities. Then sampling is performed to select the next word.

Model Temperature

This is how bold the model is in choosing the next word, or how creative it is. At low temperature, the model chooses the most probable words. At high temperature — more diverse and unexpected options, introducing an element of randomness into generation.

KV Cache

An optimization that saves previously computed Key and Value vectors of preceding tokens. Only the new token is passed through all blocks, while values for old tokens are taken from this saved KV cache. This reduces computational costs during inference but requires significant memory. For GPT-3 with a context of 2,048 tokens, this requires approximately 9 GB of memory per request.

Context

The history of interaction within a single request. The model processes the context completely with each newly generated token. If the dialogue is longer than the limit, either the oldest part is cut off, or the history is optimized. The limit varies: from 2,048 tokens in GPT-3 to 1M in modern models. They only find the next probable word — this is the key concept of how LLMs function.

Training

The training process is a lengthy stage of fitting model parameters. During this stage, millions of examples are passed through the model, with each cycle improving its performance. Training requires dedicated clusters of thousands of GPU cards connected by a specialized network and can run continuously for several months.

Key stages of one training iteration:

Forward pass — feeding a phrase into the model and predicting the next word
Loss function computation — determining how poorly the parameters are fitted
Backward pass — computing correction matrices layer by layer in reverse direction
Parameter update — applying corrections to all model parameters

This process is repeated millions of times on millions of training examples until predictions become satisfactory.

Hyperparameters

Hyperparameters are global model parameters set once before training begins:

Embedding matrix dimensionality
Number of layers
Number of neurons in layers
Gradient step size
Block architecture
Number of attention heads
Activation functions

They are not trained, but rather define the rules of model operation. That's why they're called "hyper" parameters.

What the Training Process Looks Like

Step 0: The model contains hundreds of enormous matrices filled with random numbers. At this stage, the model cannot meaningfully predict text.

Step 1: Preparing training data. Text is sliced into examples of the form "input sequence -> target token." For example:

Cat -> sat
Cat sat -> on
Cat sat on -> the windowsill

Step 2: The model performs a forward pass — exactly the same as during inference. It tries to predict the next word based on current (incorrect) weights.

Step 3: The loss function is computed — a quantitative assessment of the prediction error. For instance, if the model predicted "purple" when it should have been "on," the error will be very high.

Step 4: A backward pass is performed to compute correction matrices (gradients) across all layers, from end to beginning.

Step 5: All model weights are updated based on the computed gradients in the direction of decreasing error.

The process repeats millions of times. The error should gradually decrease with each iteration.

Backpropagation

The backpropagation mechanism is the method that allows all model parameters to be efficiently corrected based on the final error. Backpropagation is the mechanism thanks to which almost all modern neural networks exist.

The total model error (a scalar value) is computed. Based on this error, a correction vector is formed, in which for each parameter it is indicated how to change it so that the error decreases. This information is sequentially passed from the output layer to the input layer, layer by layer.

Without backpropagation, one would have to randomly change each of the billions of parameters and check whether the error improved. This exponentially increases training time and becomes practically impossible for large models. With backpropagation, a model with billions of parameters can be trained in months instead of centuries.

Gradient Descent

The process of fitting parameters can be visualized as descending from a mountain into a valley in multidimensional space, where each dimension corresponds to one model parameter.

Imagine you're on a mountainside and want to descend to the valley (the region of minimum error). At each step, you look around, determine the direction where the descent is steepest, and take a step in that direction. The direction is determined by the gradient — a vector of partial derivatives of the loss function with respect to each parameter.

The gradient is a vector containing partial derivatives of the loss function with respect to each parameter. Each component of the gradient shows how much and in which direction the corresponding parameter needs to be changed to decrease the error.

For linear regression y = k0 + k1*x, the process looks like this: knowing the current height of the loss function, derivatives with respect to k0 and k1 are computed, producing a vector of two numbers, and the parameters are changed in the direction of this vector.

For GPT-3, the space has 175 billion dimensions (one per parameter), the loss function is defined in this space, and the same principle is used — moving in the direction of the steepest descent.

Learning Rate: An important hyperparameter — the step size in the gradient direction. Too large a step leads to overshooting the optimum; too small leads to very slow learning.

Training Data

Effective training of modern LLMs requires an enormous number of examples. GPT-3, for instance, has seen virtually all digitized and publicly available text information — hundreds of billions of tokens.

Batch: Data is processed neither one example at a time nor as the entire dataset at once, but in portions — batches. Processing one example at a time gives too noisy gradients; processing the entire dataset at once requires an unacceptable amount of memory. During GPT-3 training, the batch size was 3.2 million tokens. A larger batch size provides more stable gradients.

Epochs and repetitions: On average, each training example passes through the model only once. However, high-quality data may be passed through several times. During fine-tuning, some examples are used multiple times.

Model Size, Quality, and MoE

There is a correlation between the number of parameters and the quality of model results. However, this only holds up to a certain point, beyond which saturation occurs.

Problems with large models:

Dead neurons — parameters that don't participate in computations but consume electricity
Overfitting — the model begins to "memorize" examples instead of finding patterns, reducing quality

Mixture of Experts (MoE): A sparse activation method where the neural network is divided into groups of neurons (experts), each specializing in certain aspects of information processing. A router is placed before the experts, choosing which one will handle the request.

Advantages of MoE:

Increases the number of parameters without significantly increasing computational complexity
Uses sparse activation: only some experts participate for a given query
During inference, only the needed experts are used, reducing computational load

Instead of one pair of FFN matrices in each transformer block, there are several pairs of matrices (experts). The router determines which expert processes the input data. Each expert has the same FFN architecture but its own trained weights. During training, all experts and the router are used. During inference, typically only one or two experts are activated.

Simple Arithmetic

Memory requirements for GPT-3:

Model parameters (175 billion x 2 bytes): 350 GB
Intermediate inference data: hundreds of GB for Q, K, V vectors and activations
During training: 3-6 TB of memory for batches, activations, gradients, and service data

Number of GPUs needed for training:

To fit the model on NVIDIA H100 GPUs (80 GB memory): minimum 5 cards
For full training with working memory: 40-50 cards
For efficient training in reasonable time: thousands of cards

Time costs: One training iteration can take a couple of minutes. For 100,000 examples, this amounts to approximately 4.6 months of continuous training.

For larger models: GPT-4 (rumored to contain 1.8 trillion parameters) requires tens of thousands of GPU cards. Even the newest NVIDIA B300 (288 GB onboard memory) is insufficient for fully training large models on a single card.

Conclusion: The only practical path for training large models is parallelizing computations across many GPUs.

Parallelization

Parallelization is the distribution of a neural network training task across multiple GPU cards to increase performance and reduce training time.

Global approaches:

Distribute the model itself across multiple GPUs
Parallelize training data: different copies of the model learn from different data

Usually, both approaches are combined for optimal resource utilization.

How parallelization of training is done directly affects the requirements for network infrastructure. If model copies on different GPUs frequently exchange data over a slow network, all cards will idle waiting for the exchange to complete, wasting the invested resources' productivity.

Data Parallelism (DP)

The principle: "We slice the batch of examples into smaller batches."

The batch of training examples is divided into equal parts and distributed across different GPUs. Each GPU receives a copy of the entire model and processes its portion of data independently.

The training process:

N identical copies of the model with the same weights are created on N GPUs
Each copy receives its own data batch
Each copy performs a forward pass, computes error and gradients
Copies exchange gradients through an AllReduce operation
Each copy computes the average value of gradients
All copies simultaneously update weights with the averaged gradient

AllReduce operation: A network operation that collects matrices of identical dimensions from all nodes and creates one resulting matrix through averaging. After AllReduce, all model copies have identical weights.

In some cases (for example, when using BatchNorm), it is necessary to synchronize not only gradients but also normalization statistics on the forward pass so that all training is consistent.

Data Parallelism has medium sensitivity to network latency and bandwidth. It is recommended to localize this type of parallelization within a single server rack to minimize network delays.

Sequence Parallelism (SP)

The principle: "We slice one long example into several shorter pieces."

This type of parallelization is used for processing very long sequences that don't fit in a single GPU's memory: genomic sequences, climate time series, scientific papers with full historical context, very long text contexts for LLMs.

One long sequence (e.g., 1 million tokens) is divided into several chunks (e.g., 10 chunks of 100,000 tokens each) and distributed across N GPUs.

The critical problem: In transformers, each token "sees" all other tokens through the attention mechanism (self-attention). When the sequence is split, a token from chunk #1 cannot "see" tokens from chunk #2. This breaks the attention mechanism, which is critical for understanding meaning.

The synchronization solution: Synchronization of attention vectors (Q, K, V) between GPUs processing neighboring chunks of the sequence is required, so that each token receives information about neighboring tokens from other GPUs.

Sequence Parallelism has high sensitivity to network bandwidth, as it requires frequent exchange of intermediate results between GPUs when processing each layer.

Both types of parallelization require synchronization between GPUs, which makes the network infrastructure critically important for the efficiency of training large models. If data moves too slowly, the cluster sits idle during months-long training processes on thousands of GPUs. This is why specialized Low Latency LossLess networks are critical for preventing cluster downtime.