Improving Machine Learning Performance with CPUs Amid GPU Shortages
Written on
Training AI Models on CPUs
Reassessing the Role of CPUs in Machine Learning During GPU Shortages
The recent advancements in AI are frequently linked to the rise and development of GPUs. Their architecture, which usually incorporates thousands of parallel processors, high-speed memory, and specialized tensor cores, is particularly adept at handling the demanding requirements of AI and machine learning tasks. However, the explosive growth in AI research has led to an increased demand for GPUs, making them hard to acquire. Consequently, machine learning practitioners are now investigating alternative hardware solutions for training and deploying their models. Previous discussions have highlighted dedicated AI ASICs like Google Cloud TPU, Haban Gaudi, and AWS Trainium as potential substitutes. While these alternatives can significantly reduce costs, they may not be suitable for all ML models and can also face availability issues similar to GPUs. In this article, we turn our attention back to the traditional CPU and assess its relevance in ML applications. Although CPUs are typically less efficient for ML tasks compared to GPUs, their accessibility can greatly enhance development productivity.
In earlier discussions, we underscored the necessity of analyzing and optimizing the runtime performance of AI/ML workloads to speed up development and reduce costs. This remains essential across all computing platforms, although the profiling tools and optimization strategies can differ significantly. In this article, we will explore performance optimization strategies specifically for CPUs, concentrating on Intel® Xeon® CPU processors (utilizing Intel® AVX-512) and the PyTorch (version 2.4) framework, although similar techniques can be applied to various CPUs and frameworks. Our experiments will be conducted on an Amazon EC2 c7i instance using an AWS Deep Learning AMI. Please consider our selection of cloud platform, CPU version, ML framework, or any other mentioned tools and libraries as a non-endorsement of their alternatives.
Our aim is to illustrate that while CPU-based ML development might not be the first option, there are strategies available to "soften the blow" and, in certain cases, make it a viable choice.
Disclaimers
This article aims to showcase a few ML optimization opportunities available on CPUs. Unlike the majority of online tutorials focusing on CPU optimization for inference, we will concentrate on training workloads. Numerous optimization tools are specifically designed for inference that will not be addressed here.
This article should not be seen as a substitute for the official documentation of any tools or techniques mentioned. Given the rapid evolution of AI/ML, some content, libraries, and/or instructions may become outdated by the time of your reading. Please refer to the latest documentation available.
It's important to note that the effects of the optimizations discussed on runtime performance are likely to differ significantly based on the model and specific environment (for example, refer to the performance variance among models on the official PyTorch TouchInductor CPU Inference Performance Dashboard). The performance metrics shared will be particular to the toy model and runtime environment used. Ensure to evaluate all proposed optimizations on your own model and runtime setup.
Finally, our focus will be solely on throughput performance (measured in samples per second) rather than training convergence. However, some optimization techniques (like tuning batch size, mixed precision, etc.) might negatively impact the convergence of certain models, although appropriate hyperparameter tuning could mitigate this.
Toy Example — ResNet-50
We will conduct our experiments using a basic image classification model with a ResNet-50 backbone (from Deep Residual Learning for Image Recognition). The model will be trained on a synthetic dataset. Below is the complete training script (loosely based on an earlier example):
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import time
# A dataset with random images and labels
class FakeDataset(Dataset):
def __len__(self):
return 1000000def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(data=index % 10, dtype=torch.uint8)
return rand_image, label
train_set = FakeDataset()
batch_size = 128
num_workers = 0
train_loader = DataLoader(
dataset=train_set,
batch_size=batch_size,
num_workers=num_workers
)
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
t0 = time.perf_counter()
summ = 0
count = 0
for idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
batch_time = time.perf_counter() - t0
if idx > 10: # skip first steps
summ += batch_time
count += 1
t0 = time.perf_counter()
if idx > 100:
breakprint(f'average step time: {summ/count}')
print(f'throughput: {count*batch_size/summ}')
When executed on a c7i.2xlarge instance (with 8 vCPUs) and the CPU version of PyTorch 2.4, this script yields a throughput of 9.12 samples per second. For comparison, the same (unoptimized) script on an Amazon EC2 g5.2xlarge instance (with 1 GPU and 8 vCPUs) achieves a throughput of 340 samples per second. Considering the cost difference between these two instance types ($0.357 per hour for c7i.2xlarge and $1.212 for g5.2xlarge at the time of writing), training on the GPU instance delivers approximately eleven times better price performance. Hence, the preference for GPUs in training ML models is well justified. Let's explore some methods to narrow this gap.
PyTorch Performance Optimizations
In this section, we will examine several fundamental techniques for enhancing the runtime performance of our training workload. While some of these may seem familiar from our GPU optimization discussions, it's crucial to note a key difference: on GPU platforms, significant effort was spent maximizing parallelization between CPU (for data preprocessing) and GPU (for model training). Conversely, on CPU platforms, all processing occurs on the CPU, necessitating effective resource allocation.
Batch Size
Increasing the training batch size can potentially improve performance by reducing the frequency of model parameter updates. (On GPUs, it also minimizes overhead from CPU-GPU transactions, such as kernel loading.) However, while we aimed for a batch size that maximizes GPU memory usage, this strategy could hinder performance on CPUs. CPU memory management is more complex, and finding the most optimal batch size may require trial and error. Remember, adjusting the batch size could influence training convergence.
The following table summarizes the throughput of our training workload with various (arbitrary) batch sizes:
Unlike our GPU findings, our model on the c7i.2xlarge instance appears to favor smaller batch sizes.
Multi-process Data Loading
A common practice on GPUs involves assigning multiple processes to the data loader to reduce the risk of GPU starvation. On GPU setups, a general rule is to match the number of workers to the number of CPU cores. However, on CPU platforms, where model training shares resources with the data loader, this approach might be counterproductive. Again, the optimal number of workers may be determined through trial and error. The table below shows average throughput for different num_workers settings:
Mixed Precision
Utilizing lower precision floating point data types like torch.float16 or torch.bfloat16 is another common technique. The dynamic range of torch.bfloat16 is generally regarded as more suited for ML training. Reducing datatype precision can negatively affect convergence, so this should be approached cautiously. PyTorch provides torch.amp, an automatic mixed precision package for optimizing datatype use. Intel® AVX-512 supports the bfloat16 datatype. The modified training step is as follows:
for idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
with torch.amp.autocast('cpu', dtype=torch.bfloat16):
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
Post-optimization, the throughput rises to 24.34 samples per second, marking an 86% increase!
Channels Last Memory Format
The channels last memory format is a beta-level optimization (at the time of this writing) primarily for vision models, allowing four-dimensional (NCHW) tensors to be stored with the channels as the last dimension. This arrangement keeps pixel data together, making it more "friendly" to Intel platforms. This optimization has been reported to enhance the performance of ResNet-50 on Intel® Xeon® CPUs. The updated training step is:
for idx, (data, target) in enumerate(train_loader):
data = data.to(memory_format=torch.channels_last)
optimizer.zero_grad()
with torch.amp.autocast('cpu', dtype=torch.bfloat16):
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
The resulting throughput climbs to 37.93 samples per second — a further 56% improvement, totaling a 415% boost compared to our baseline experiment.
Torch Compilation
In a previous discussion, we explored the benefits of PyTorch’s graph compilation support and its potential impact on runtime performance. Unlike the default eager execution mode, where each operation executes independently, the compile API transforms the model into an intermediate computation graph that is JIT-compiled into low-level machine code, optimized for the underlying training engine. The API supports compilation through various backend libraries and multiple configuration options. Here, we will focus on the default (TorchInductor) backend and the ipex backend from the Intel® Extension for PyTorch, which includes specific optimizations for Intel hardware. The updated model definition is:
import intel_extension_for_pytorch as ipex
model = torchvision.models.resnet50()
backend = 'inductor' # can optionally switch to 'ipex'
model = torch.compile(model, backend=backend)
For our toy model, the impact of torch compilation is noticeable only when the "channels last" optimization is disabled (around a 27% increase for each backend). When "channels last" is applied, performance actually declines. Thus, we will exclude this optimization from our subsequent tests.
Memory and Thread Optimizations
There are numerous opportunities to enhance the utilization of underlying CPU resources, including optimizing memory management and thread allocation according to the CPU architecture. Memory management can benefit from advanced memory allocators (like Jemalloc and TCMalloc) and minimizing slower memory accesses (i.e., across NUMA nodes). Thread allocation can be improved by properly configuring the OpenMP threading library and/or utilizing Intel’s OpenMP library.
Generally, these optimizations necessitate a comprehensive understanding of CPU architecture and its supporting software stack. To simplify this, PyTorch provides the torch.backends.xeon.run_cpu script to automatically configure memory and threading libraries for optimal runtime performance. The command below will utilize the dedicated memory and threading libraries. The installation of TCMalloc (conda install conda-forge::gperftools) and Intel’s OpenMP library (pip install intel-openmp) should be verified before executing:
python -m torch.backends.xeon.run_cpu train.py
Employing the run_cpu script boosts our runtime performance to 39.05 samples per second. Note that this script offers many options for further performance tuning; consult the documentation to maximize its utility.
The Intel Extension for PyTorch
The Intel® Extension for PyTorch presents additional optimization opportunities for training via its ipex.optimize function. Below is a demonstration of its default application. Refer to the documentation for its full capabilities.
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
model, optimizer = ipex.optimize(
model,
optimizer=optimizer,
dtype=torch.bfloat16
)
When combined with the aforementioned memory and thread optimizations, the throughput reaches 40.73 samples per second. (This result is similar when the "channels last" configuration is disabled.)
Distributed Training on CPU
Intel® Xeon® processors are designed with Non-Uniform Memory Access (NUMA), where CPU memory is partitioned into groups, or NUMA nodes, with each CPU core assigned to one node. While any CPU core can access memory from any NUMA node, access to its local memory is significantly faster. This opens up the possibility of distributing training across NUMA nodes, treating CPU cores assigned to each NUMA node as a single process in a distributed process group, with data distribution managed by Intel® oneCCL, Intel’s dedicated collective communications library.
We can easily run data distributed training across NUMA nodes using the ipexrun utility. Below is an adaptation of our script for data distributed training (based on this example):
import os, time
import torch
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torchvision
import oneccl_bindings_for_pytorch as torch_ccl
import intel_extension_for_pytorch as ipex
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "29500"
os.environ["RANK"] = os.environ.get("PMI_RANK", "0")
os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", "1")
dist.init_process_group(backend="ccl", init_method="env://")
rank = os.environ["RANK"]
world_size = os.environ["WORLD_SIZE"]
batch_size = 128
num_workers = 0
# define dataset and dataloader
class FakeDataset(Dataset):
def __len__(self):
return 1000000def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(data=index % 10, dtype=torch.uint8)
return rand_image, label
train_dataset = FakeDataset()
dist_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
num_workers=num_workers,
sampler=dist_sampler
)
# define model artifacts
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
model, optimizer = ipex.optimize(
model,
optimizer=optimizer,
dtype=torch.bfloat16
)
# configure DDP
model = torch.nn.parallel.DistributedDataParallel(model)
# run training loop
# destroy the process group
dist.destroy_process_group()
Regrettably, the Amazon EC2 c7i instance family does not currently offer multi-NUMA instance types. To test our distributed training script, we will revert to an Amazon EC2 c6i.32xlarge instance featuring 64 vCPUs and 2 NUMA nodes. After confirming the installation of Intel® oneCCL Bindings for PyTorch, we will execute the following command:
source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh
# This command will utilize all the NUMA sockets of the processor, treating each socket as a rank.
ipexrun cpu --nnodes 1 --omp_runtime intel train.py
The following table compares performance results on the c6i.32xlarge instance with and without distributed training:
In our tests, data distribution did not enhance runtime performance. Refer to the ipexrun documentation for further performance tuning options.
CPU Training with Torch/XLA
Earlier discussions have covered the PyTorch/XLA library and its utilization of XLA compilation to facilitate PyTorch-based training on XLA devices such as TPU, GPU, and CPU. Similar to torch compilation, XLA employs graph compilation to create machine code optimized for the target device. The establishment of the OpenXLA Project aims to ensure high performance across all hardware backends, including CPUs (see the CPU RFC here). The following code snippet illustrates the modifications needed for our original (unoptimized) script to train using PyTorch/XLA:
import torch
import torchvision
import time
import torch_xla
import torch_xla.core.xla_model as xm
device = xm.xla_device()
model = torchvision.models.resnet50().to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
model.train()
for idx, (data, target) in enumerate(train_loader):
data = data.to(device)
target = target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
xm.mark_step()
Unfortunately, the XLA results on our toy model appear significantly inferior to the unoptimized outcomes we observed earlier (by as much as 7X). We anticipate improvements as PyTorch/XLA’s CPU support evolves.
Results
We summarize the findings from a subset of our experiments in the table below. For comparison, we include the throughput of training our model on an Amazon EC2 g5.2xlarge GPU instance, following the optimization steps discussed in this article. The samples per dollar metric is calculated based on the Amazon EC2 On-demand pricing page ($0.357 per hour for c7i.2xlarge and $1.212 for g5.2xlarge at the time of writing).
While we successfully enhanced the training performance of our toy model on the CPU instance by a significant margin (446%), it still lags behind the optimized performance on the GPU instance. Our analysis suggests that training on a GPU would be approximately 6.7 times more economical. It’s likely that with further performance tuning and/or additional optimization strategies, we could further narrow this gap. Again, we stress that the comparative performance results presented are unique to this model and runtime environment.
Amazon EC2 Spot Instances Discounts
The growing availability of cloud-based CPU instance types (as opposed to GPU instance types) may present greater opportunities for acquiring computing power at discounted rates, such as through Spot Instance usage. Amazon EC2 Spot Instances are derived from surplus cloud service capacity, offered at discounts of up to 90% off the On-Demand pricing. In exchange for the lower price, AWS reserves the right to preempt the instance with little to no warning. Given the high demand for GPUs, CPU Spot Instances may be easier to procure than their GPU counterparts. Currently, the Spot Instance price for c7i.2xlarge is $0.1291, enhancing our samples per dollar result to 1135.76 and further bridging the performance price gap between optimized GPU and CPU training (to 2.43X).
Although the runtime performance results for the optimized CPU training of our toy model (and the chosen environment) fell short of GPU results, it’s plausible that applying the same optimization steps to different model architectures (e.g., those incorporating components unsupported by GPUs) could lead to CPU performance matching or exceeding that of GPUs. Additionally, in scenarios where GPU compute capacity is limited, it may be justifiable to run certain ML workloads on CPUs.
Summary
Considering the widespread availability of CPUs, effectively utilizing them for training or executing ML workloads could significantly impact development productivity and deployment strategies for end products. While CPU architecture may be less suited for many ML applications compared to GPUs, numerous tools and techniques exist to enhance its performance—some of which we have discussed and demonstrated in this article.
This post concentrated on optimizing CPU training. We encourage you to explore our many other articles covering a diverse range of topics related to performance analysis and optimization in machine learning workloads.