Unlocking Inference Speed: Marlin's Breakthrough for 4-bit LLMs
Written on
Chapter 1: Introduction to Marlin
Large language models (LLMs) are often too cumbersome for typical consumer hardware. To make them more manageable, various methods have been developed to quantize LLMs and reduce their memory footprint. While many recent algorithms for 4-bit quantization come with their own optimized CUDA kernels, the actual inference speed of these quantized models still leaves much to be desired.
For instance, using the INT4 data type for inference involves INT4xFP16 operations, which can be slow even on contemporary GPUs. This highlights the necessity for optimized CUDA kernels. The Institute of Science and Technology Austria (ISTA) introduces Marlin, a highly optimized INT4xFP16 matmul kernel that promises to achieve nearly ideal inference speeds—up to 4 times faster.
In this article, I will delve into how Marlin accomplishes this speed enhancement. We will also explore the process of converting existing GPTQ models to the Marlin format, using Mistral 7B as our case study, and evaluate the inference speed using vLLM.
Section 1.1: Understanding Marlin's Optimizations
As of this writing, Marlin has not been detailed in any academic paper, but a comprehensive README.md is available on Marlin's GitHub repository explaining its functionality: IST-DASLab/marlin (Apache 2.0 license).
GPUs are designed to perform operations and transfer data efficiently, typically achieving 100–200 times more operations than data transfers. By employing 4-bit (INT4) weights, we can theoretically boost these operations to be four times quicker than with half-precision (FP16) weights. However, realizing this potential is challenging, as it necessitates the full utilization of a GPU's capabilities, including its memory architecture and multiple cores concurrently.
Marlin addresses these challenges through a variety of optimizations. It ensures efficient data retrieval from the GPU's L2 cache and maximizes data reuse, thereby minimizing delays caused by reloading data. Another significant enhancement is the use of double buffering, enabling data loading while computations are in progress, thus maintaining a smooth workflow with no unnecessary interruptions.
Moreover, Marlin strategically organizes the sequence of data dequantization and computation execution during inference. This meticulous planning, paired with careful data arrangement, ensures that all GPU components remain fully engaged.
Additionally, Marlin incorporates optimizations for multi-GPU scenarios, enhancing parallel processing without needing to load additional data simultaneously, which effectively distributes the workload across GPUs.
These various enhancements lead to an almost optimal utilization of GPU resources. Even with a batch size of 1, Marlin outperforms all existing frameworks, including ExLlamaV2 and AWQ, which already leverage custom kernels for faster inference. Remarkably, for larger batch sizes starting from 8, these frameworks lag behind FP16 inference, while Marlin continues to deliver nearly 4 times the speed.
Section 1.2: GPTQ Conversion for Fast Inference
Marlin is already compatible with AutoGPTQ (MIT license), one of the most widely used libraries for quantizing LLMs with GPTQ. For those interested in quantizing a model with GPTQ, I have previously written about this process here:
Note: I've also created a notebook that implements all the code discussed in this section. You can access it here: Get the notebook (#56).
Utilizing Marlin is straightforward; there is no need to quantize the model again if it has already been quantized. Instead, we simply reformat the model for Marlin compatibility. However, it’s important to note that Marlin is only supported on Ampere and newer GPUs (RTX 30xx/40xx, A100, etc.). To verify your GPU's compatibility, run the following code:
import torch
major_version, minor_version = torch.cuda.get_device_capability()
if major_version >= 8:
print("Your GPU supports Marlin!")
else:
print("Your GPU doesn't support Marlin... You need an Ampere GPU or more recent (RTX 30xx/40xx, A100, H100, ...)")
If your GPU is compatible, you can proceed with the installation of the required libraries:
pip install --upgrade transformers auto-gptq accelerate optimum
For demonstration purposes, I used my own GPTQ version of Mistral 7B: kaitchup/Mistral-7B-v0.1-gptq-4bit. To convert it to Marlin's format, load the model with AutoGPTQ, passing the argument use_marlin=True:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
GPTQ_MODEL = "kaitchup/Mistral-7B-v0.1-gptq-4bit"
marlin_model = AutoGPTQForCausalLM.from_quantized(
GPTQ_MODEL,
use_marlin=True,
device_map='auto')
This conversion typically reduces the model's size in most instances. If successful, AutoGPTQ will log the conversion progress, as shown in the following example.
This conversion process is rapid, often taking less than a minute on Google Colab with the A100 GPU. You can expect similar performance if using an RTX GPU. Furthermore, you can save the model in this format to avoid future conversions:
save_dir = "Mistral-7B-v0.1-gptq-marlin-4bit"
marlin_model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
You can find my model on the Hugging Face hub: kaitchup/Mistral-7B-v0.1-gptq-marlin-4bit.
Section 1.3: Benchmarking Inference Speed
I employed vLLM (Apache 2.0 license) to assess the inference speed of the Marlin model and compared it with the original GPTQ model (i.e., the non-converted version). I experimented with batch sizes of 1, 2, 4, 8, 16, 32, 64, and 128. According to the authors of Marlin, the acceleration benefits of Marlin should become increasingly evident with larger batch sizes.
The code used for this benchmarking is as follows:
import time
from vllm import LLM, SamplingParams
batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128]
p = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.nnIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. nn Tell me about gravity."
sampling_params = SamplingParams(max_tokens=1000)
loading_start = time.time()
llm = LLM(model="kaitchup/Mistral-7B-v0.1-gptq-4bit")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))
for b in batch_sizes:
prompts = [p] * b
generation_time = time.time()
outputs = llm.generate(prompts, sampling_params)
duration = time.time() - generation_time
total_tokens = sum(len(output.prompt_token_ids) + len(output.outputs[0].token_ids) for output in outputs)
print('nBatch size: '+str(b))
print("--- Speed: %s tokens/second ---" % (round(total_tokens/duration,2)))
The results (displayed on a log scale) indicate that Marlin outperforms vanilla GPTQ, particularly with larger batch sizes.
Note: Inference speed can fluctuate considerably between runs. The averages presented were obtained by executing the decoding process five times for each batch size.
Main Observations:
- Marlin demonstrates faster performance, but vLLM only takes full advantage of it for batch sizes exceeding 8.
- The performance gap between Marlin and vanilla GPTQ widens with larger batch sizes.
- Notably, vLLM is already highly optimized for single-pass decoding (batch size = 1). Batching with a size of 2 can be twice as slow as single-pass decoding.
Section 1.4: Conclusion
Marlin is designed to make 4-bit model inference nearly four times quicker than that of FP16 models. The conversion of a GPTQ model to Marlin is both swift and straightforward. While there may be challenges in achieving the maximum acceleration claimed by Marlin's developers, significant speedups are evident with batch sizes of 8 or more. It's conceivable that smaller batch sizes could yield even greater speeds with a framework and GPTQ model optimized for Marlin.
If you are utilizing GPTQ models, there seems to be no reason to avoid incorporating Marlin moving forward. The Marlin format should soon be accessible for additional quantization algorithms.
To stay updated with my work, consider subscribing to my newsletter for more articles and tutorials on the latest developments in AI: