Efficiently Running 70B Language Models on Local Machines
Written on
Chapter 1: Understanding Large Language Models
Running massive language models like those with 70 billion parameters on local hardware often seems daunting. A key challenge is determining if a single GPU can handle such tasks and what the minimal memory requirements would be.
A 70 billion parameter model typically requires around 130GB just for its parameters. Initializing the model on a GPU necessitates at least two A100 GPUs, each boasting 100GB of memory. When this model processes input during inference, the memory usage can surge significantly, particularly due to complex computations associated with "attention." The memory demands for these attention mechanisms increase quadratically with the length of the input sequence, meaning that in addition to the model’s core 130GB, additional memory is essential.
However, innovative strategies can facilitate running inference on a mere 4GB GPU without compromising performance through techniques such as quantization, distillation, or pruning. This article explores these methods designed to drastically reduce memory requirements while working with such expansive models.
Video Description: This video explores distributed inference techniques for running large language models on multiple GPUs and nodes, focusing on Llama 3.1 70B.
Section 1.1: Layer-Wise Inference Explained
Layer-wise inference takes advantage of the sequential nature of forward propagation during inference. As each layer completes its calculations, it can free up its memory allocation since it is no longer required. This process allows for consistent memory management and efficient utilization of resources.
Most large language models today, especially those based on the self-attention mechanism introduced in Google's “Attention is All You Need,” follow a standardized structure known as the Transformer model. A typical large language model begins with an embedding projection layer and is followed by numerous identical transformer layers, often totaling 80. The model concludes with a normalization layer and a fully connected layer that predicts the probabilities for the next token.
During inference, only one layer is processed at a time, making it unnecessary to retain all layers in GPU memory simultaneously. Instead, the required layer can be loaded from storage, computations can be performed, and then the memory can be cleared. Consequently, the GPU memory is primarily burdened by the parameter size of a single transformer layer, roughly 1/80th of the total model, translating to about 1.6GB.
In addition, various output caches must reside in GPU memory, with the KV cache being the largest, designed to avoid redundant calculations. For example, the KV cache for a 70 billion parameter model with an input length of 100 would require approximately 30MB of GPU memory. Monitoring tools indicate that the entire inference process can utilize less than 4GB of GPU memory.
Subsection 1.1.1: Flash Attention Optimization
An important advancement in the realm of large language models is the 'Flash Attention' optimization. This enhancement fundamentally changes the memory consumption patterns of these models, which often operate on a unified codebase. Flash Attention improves memory usage by allowing sequential computation rather than maintaining O(n²) intermediate results, effectively reducing memory complexity to O(logn).
Although Flash Attention carries a slightly higher memory complexity of O(n), it significantly optimizes CUDA memory access, leading to faster inference and training times. By processing in smaller segments, it lowers memory requirements to the size of one segment compared to the traditional method that stores O(n²) intermediate results.
Section 1.2: Model File Partitioning
Another effective strategy is partitioning the model file. The original model file is typically divided into chunks of around 10GB each. Given that each layer is roughly 1.6GB, loading a full 10GB chunk for every layer results in inefficient memory and disk resource utilization, as disk read speeds often become the limiting factor.
To optimize this, the original HuggingFace model file is restructured into layer-specific shards using safetensor technology. This format closely resembles the in-memory structure and utilizes memory mapping to enhance loading speeds.
To implement this, the HuggingFace Accelerate library's meta device is utilized. This virtual device allows for the operation of exceptionally large models by loading only the code initially and not the actual model data, effectively minimizing memory usage to zero. Specific parts of the model can be dynamically transferred from the meta device to physical devices like CPUs or GPUs as required.
Video Description: NVIDIA experts break down the mechanics of LLM inference, providing insights into how AI operates behind the scenes.
Chapter 2: Practical Implementation
The open-source library AirLLM simplifies the implementation process to just a few lines of code. Below are the usage instructions and a link to the AirLLM repository on GitHub.
def get_tokens(row, tokenizer):
# Function to prepare tokens for input.