Optimizing Deep Learning Training for Maximum GPU Utilization

Oftentimes, one of the most asked questions by new data scientists and ML engineers is whether their deep learning training processes are running optimally. In this guide, we will learn how to diagnose and fix deep learning performance issues regardless of whether we are working on one or numerous machines. This is to help us understand how to make practical and effective use of the wide variety of available cloud GPUs.

We will start by understanding what GPU utilization is, and we’ll finish by discussing the optimal batch size for maximum GPU utilization.

Note: This guide assumes we have basic understanding of the Linux operating system and the Python programming language. The latest Linux distros come with Ubuntu pre-installed, so we can go ahead and install pip and conda, as we will use them here.

Prerequisites

In order to follow along with this article, you will need experience with Python code and a beginner’s understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.

What is GPU Utilization?

In machine and deep learning training sessions, GPU utilization is the most important aspect to observe, and is available through notable GPU third-party and built-in tools.

We can define GPU utilization as the speed that a single or multiple GPU kernels are operating over the last second, which is parallel to a GPU being used by a deep learning program.

How Do You Know You Need More GPU Compute?

Let us look at a real scenario here.

In a typical day, a data scientist gets two GPUs that he/she can use – these “should” be sufficient resources. Most of the days during the build part, there’s no problem interacting with the GPU’s short cycles and the workflow is smooth. Then the training phase kicks in, and suddenly the workflow demands additional GPU compute that is not readily available.

This means that more compute resources will be required to do any sort of significant work. We place particular emphasis on the following tasks as being impossible when all RAM is allocated:

  • Run more experiments
  • Run multi-GPU training to speed up training for experimenting with larger batch sizes and achieve higher model accuracy
  • Focus on a new model while training models run independently

Benefits of GPU Utilization

In general, these upgrades transform into a double increase in the utilization of hardware and a 100% increase in model training speed.

GPU utilization will enable us to manage resource allocations more efficiently, ultimately reducing GPU idle time and increasing cluster utilization.

From the perspective of a deep learning specialist, consuming more GPU compute power provides room for running more experiments, improving productivity, and enhancing model quality.

Additionally, IT administrators can run distributed training models using multiple GPUs.

The Optimal Batch Size for GPU Utilization

The general experience with batch size is always confusing because there is no single “best” batch size for a given data set and model architecture. If we decide to pick a larger batch size, it will train faster and consume more memory, but it might show lower accuracy in the end. First, let us understand what a batch size is and why you need it.

Finding the Optimal Batch Size for Deep Learning Training

One of the most common questions among new data scientists and ML engineers is whether their deep learning training process is running optimally. In this guide, we will explore how to diagnose and fix deep learning performance issues, whether working on a single machine or a distributed system. The goal is to maximize GPU utilization effectively.

We will start by understanding GPU utilization and conclude with an in-depth look at selecting the optimal batch size for maximizing GPU performance.

What is a Batch Size?

When training a deep learning neural network, it is crucial to define a batch size. Simply put, batch size refers to the number of training samples that are processed by the network in a single forward and backward pass.

Choosing the right batch size is important because it affects training speed, memory usage, and model performance. The batch size must be carefully tuned to ensure efficient learning while making the best use of available hardware.

Example: Understanding Batch Size

To better illustrate the concept, let’s consider a scenario where we are training a neural network to classify different cat breeds. Suppose we have a dataset consisting of 1,000 cat images.

Now, if we select a batch size of 10, it means that during training, the model will process 10 images at a time before updating its weights. Instead of feeding all 1,000 images at once, they are divided into smaller, more manageable batches.

Without batching, we could theoretically pass each image individually, but this would be inefficient in terms of computational time and learning performance. Conversely, attempting to process all 1,000 images simultaneously may exceed the system’s memory capacity.

Why Use Batches?

As previously mentioned, a larger batch size helps a model complete each epoch more quickly during training. This is because modern computing resources allow machines to process multiple samples simultaneously instead of handling one data point at a time.

However, even if our hardware is capable of handling very large batches, there are trade-offs. As batch size increases, the model’s ability to generalize on unseen data may decrease, leading to potential overfitting. This is why choosing an appropriate batch size is critical.

Batch size is an essential hyperparameter that must be fine-tuned based on the model’s performance during training. Additionally, the way a machine utilizes the GPU with different batch sizes should be carefully monitored.

Example of GPU Utilization and Batch Size

For instance, if we set a batch size to a high number, such as 100, our system might lack the processing capacity to handle all 100 images simultaneously. If GPU memory is exhausted, this would indicate that we need to reduce the batch size to avoid resource bottlenecks.

Now that we have a foundational understanding of batch size, let’s explore how to determine the optimal batch size using PyTorch and Keras.

Find the Right Batch Size Using PyTorch

In this section, we will demonstrate how to determine the best batch size for training a ResNet18 model. To measure performance and GPU utilization, we will leverage the PyTorch Profiler.

Using TensorBoard to Monitor Performance

To provide a more in-depth analysis, we will integrate PyTorch Profiler with TensorBoard, enabling us to visualize performance metrics during training.

Follow Along with This Demo

On your cloud GPU-powered machine, use wget to download the corresponding notebook. Then, open the notebook using Jupyter Lab by executing the following commands:

    wget https://raw.githubusercontent.com/gradient-ai/batch-optimization-DL/refs/heads/main/notebook.ipynb
    jupyter lab

Setup and Preparation of Data and Model

To begin, install the required dependencies for PyTorch, torchvision, and the PyTorch Profiler:

    pip3 install torch torchvision torch-tb-profiler

Loading the Dataset and Training the Model

The following code loads the CIFAR-10 dataset and fine-tunes a pre-trained ResNet18 model using transfer learning.

    import torch
    import torch.nn
    import torch.optim
    import torch.profiler
    import torch.utils.data
    import torchvision.datasets
    import torchvision.models
    import torchvision.transforms as T

    # Prepare input data and transformation
    transform = T.Compose([
        T.Resize(224),
        T.ToTensor(),
        T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    # Load CIFAR-10 dataset
    train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=1, shuffle=True, num_workers=4)

    # Initialize model, loss function, and optimizer
    device = torch.device("cuda:0")
    model = torchvision.models.resnet18(pretrained=True).cuda(device)
    criterion = torch.nn.CrossEntropyLoss().cuda(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    model.train()

    # Define training function
    def train(data):
        inputs, labels = data[0].to(device), data[1].to(device)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Enabling Profiling for Performance Optimization

We have successfully setup our basic model, now we are going to enable the optional features in the profiler to record more information during the training process. Let’s include the following parameters:

  • schedule – this parameter takes a single  step(int), and returns the profiler action to perform at every stage.
  • profile_memory – This is used to allocate GPU memory and setting it to true may cost you additional time.
  • with_stack – used to record source information for all traces.

Now that we understand these terms, we can return to the code:

    with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18_batchsize1'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True
    ) as prof:
        for step, batch_data in enumerate(train_loader):
            if step >= (1 + 1 + 3) * 2:
                break
            train(batch_data)
            prof.step() # Notify profiler of step boundary

Find the Right Batch Size Using Keras

We are going to use an arbitrary Sequential model in this case.

Defining the Model

Below is an example of a basic sequential model in Keras:

    

    model = Sequential([
        Dense(units=16, input_shape=(1,), activation='relu'),
        Dense(units=32, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
        Dense(units=2, activation='sigmoid')
    ])

Training the Model with a Defined Batch Size

The batch_size parameter determines how many samples are processed at once before updating the model’s weights. In the code below, we set batch_size=10, meaning 10 samples will be processed in a single step.

    model.fit(
        x=scaled_train_samples,
        y=train_labels,
        validation_data=valid_set,
        batch_size=10,
        epochs=20,
        shuffle=True,
        verbose=2
    )

Why Use Batches?

A larger batch size allows a model to complete each epoch faster, as multiple data points are processed simultaneously. However, it also has drawbacks:

  • Memory Usage: Larger batches require more GPU memory.
  • Model Accuracy: Excessively large batch sizes may reduce accuracy and generalization.
  • GPU Utilization: Proper batch tuning ensures balanced GPU workload.

For example, setting a batch size of 100 may exceed GPU capacity, requiring a reduction.

Important Things to Pay Attention to

When performing multi-GPU training, batch size affects speed, memory usage, convergence, and model stability. Below are the critical aspects to consider:

1. Speed and Memory

Training with larger batch sizes generally improves speed. However, small batches incur more overhead due to frequent weight updates. Large batch sizes also demand more GPU memory, leading to potential out-of-memory errors.

2. Model Convergence

Batch size affects how well a model generalizes. If using stochastic gradient descent (SGD) or its variants, larger batches may result in suboptimal generalization. Common batch sizes in computer vision range from 32 to 512 images.

3. Potential GPU Issues

During multi-GPU training, improper batch sizing can lead to problems:

  • If the dataset size is not divisible by the batch size, the last batch may contain fewer samples.
  • Certain layers, such as Batch Normalization, may fail when handling inconsistent batch sizes, leading to NaN values.

To avoid this, we recommend ensuring batch size is fixed and consistent across all GPUs. If the batch does not fit the expected size, it can either be discarded or padded with repeated samples.

Corrupting the GPUs

During multi-GPU training, the last batch of an epoch may have fewer samples than expected. If this happens:

  • Some GPUs may receive no data.
  • Batch Normalization layers may fail, producing NaN values.
  • Training might appear fine but cause corrupt model weights.

To prevent this:

  • Ensure batch sizes are fixed.
  • Reject incomplete batches or pad them to fit the required size.
  • Use batch sizes larger than the total number of GPUs.

Conclusion

In this article, we explored various methods to maximize GPU utilization by determining the optimal batch size. Proper batch size selection enhances training efficiency, memory usage, and overall model performance.

As long as you choose a reasonable batch size (16 or more) and maintain the same iterations and epochs, the batch size itself will have minimal impact on final performance. However, training time will be affected based on the size of the batch.

For multi-GPU training, it is recommended to use the smallest batch size possible per GPU to ensure that each GPU can train at full capacity. A good starting point is 16 samples per GPU to achieve a balance between speed and performance.

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: