Guide to Running Large Language Models Using Ollama on H100 GPUs

This article is a guide to run Large Language Models using Ollama on H100 GPUs. Now, with support for NVIDIA H100 GPUs, users can accelerate AI/ML development, test, deploy, and optimize their applications seamlessly—without the need for extensive setup or maintenance typically associated with traditional platforms.

Ollama is an open source tool which provides access to a diverse library of pre-trained models, offers effortless installation and setup across different operating systems, and exposes a local API for seamless integration into applications and workflows. Users can customize and fine-tune LLMs, optimize performance with hardware acceleration, and benefit from interactive user interfaces for intuitive interactions.

Prerequisites

  • Access to H100 GPUs: Ensure you have access to NVIDIA H100 GPUs, either through on-premise hardware or using ccloud³ VMs equipped with Cloud GPUs
  • Supported Frameworks: Familiarity with Python and Linux Commands.
  • CUDA and cuDNN Installed: Ensure NVIDIA CUDA and cuDNN libraries are installed for optimal GPU performance.
  • Sufficient Storage and Memory: Have ample storage and memory available to handle large model datasets and weights.
  • Basic Understanding of LLMs: A foundational understanding of large language models and their structure to effectively manage and optimize them.

These prerequisites help ensure a smooth and efficient experience when running LLMs with Ollama on H100 GPUs.

What is Ollama?

Ollama offers a way to download a large language model from its vast language model library which consists of Llama3.1, Mistral, Code Llama, Gemma and much more. Ollama combines model weights, configuration, and data into one package, specified by a Modelfile.

Ollama provides a flexible platform for creating, importing, and using custom or pre-existing language models, ideal for creating chatbots, text summarization, and much more. It emphasizes privacy, integrates seamlessly with Windows, macOS, and Linux, and is free to use. Ollama also allows users to deploy models locally with ease. Further, the platform also supports real-time interactions via a REST API.

It’s perfect for LLM-powered web apps and tools. It’s very similar to how Docker works. With Docker, we can grab different images from a central hub and run them in containers. Furthermore, Ollama allows us to customize the models by creating a Modelfile. Below is the code to create Modelfile:

From llama2

# Set the temperature
PARAMETER temperature 1

# Set the system Prompt

SYSTEM """ 
You are a helpful teaching assistant created by DO. 
Answer questions asked based on Artificial Intelligence, Deep Learning. """ 

Next, run the custom model,

Ollama create MLexp \-f ./Modelfile
Ollama run MLexp

The Power of NVIDIA H100 GPUs

The H100 is Nvidia’s most powerful GPU, specially designed for artificial intelligence applications. With 80 billion transistors—six times more than the A100—it can process large data sets much faster than other GPUs on the market.

As we all know, AI applications are data hungry and are computationally expensive. To manage this huge amount of workload, H100 are considered to be the best choice.

The H100 features fourth-generation tensor cores and a transformer engine with FP8 precision. The H100 triples the floating-point operations per second (FLOPS) compared to previous models, delivering 60 teraflops of double-precision (FP64) computing, which is crucial for precise calculations in HPC tasks. It can perform single-precision matrix-multiply operations at one petaflop throughput using TF32 precision without requiring any changes to existing code, making it user-friendly for developers.

The H100 introduces DPX instructions that significantly boost performance for dynamic programming tasks, achieving 7X better performance than the A100 and 40X faster than CPUs for specific algorithms like DNA sequence alignment.

H100 GPUs provide the necessary computational power, offering 3 terabytes per second (TB/s) of memory bandwidth per GPU. This high performance allows for efficient handling of large datasets.

The H100 supports scalability through technologies like NVLink and NVSwitch™, which allows multiple GPUs to work together effectively.

Why Run LLMs with Ollama on H100 GPUs?

To run Ollama efficiently, a GPU from NVIDIA is required to run things hassle-free. As with CPU, users can expect a slow response.

  • H100, due to its advanced architecture, offers exceptional computing power, which helps to significantly speed up the efficiency of LLMs.
  • Ollama lets users customize and fine-tune LLMs to meet their specific needs, enabling prompt engineering, few-shot learning, and tailored fine-tuning to align models with desired outcomes. Pairing Ollama with H100 GPUs enhances model inference and training times for developers and researchers.
  • H100 GPUs have the capacity to handle models such as Falcon 180b, which makes them ideal for creating and deploying Gen AI tools like chatbots or RAG applications.
  • H100 GPUs come with hardware optimizations like tensor cores, which significantly accelerate tasks involving LLMs, especially when dealing with matrix-heavy operations.

Setting Up Ollama with H100 GPUs

Ollama is very well compatible with Windows, macOS, or Linux. Here we are using Linux code as our example is based on Linux OS.

Run the code below in your terminal to check the GPU specification.

Next, we will try to install Ollama first using the same terminal.

curl \-fsSL https://ollama.com/install.sh | sh

This will instantly start the Ollama installation.

Once the installation is done, we can pull any LLM and start working with the model such as Llama 3.1, Phi3, Mistral, Gemma 2, or any other model.

To run and chat with models, we will run the below code. Please feel free to change the model as per your requirements. Running the model with Ollama is quite straightforward, and here we are using the powerful H100, the process to generate a response becomes fast and efficient.

ollama run example_model
ollama run qwen2:7b

Handling Connection Errors in Ollama

In case of the error “could not connect to ollama app, is it running?” Please use the below code to connect to Ollama:

sudo systemctl enable ollama
sudo systemctl start ollama

Supported Models in Ollama

Ollama supports a wide list of models. Here are some example models that can be downloaded and used:

Model Parameters Size Download Command
Llama 3.1 8B 4.7GB Ollama run llama3.1
Llama 3.1 70B 40GB Ollama run llama3.1:70b
Llama 3.1 405B 231GB Ollama run llama3.1:405b
Phi 3 Mini 3.8B 2.3GB Ollama run phi3
Phi 3 Medium 14B 7.9GB Ollama run phi3:medium
Gemma 2 27B 16GB Ollama run gemma2:27b
Mistral 7B 4.1GB Ollama run mistral
Code Llama 7B 3.8GB Ollama run codellama

With Ollama, users can run LLMs conveniently without even the need for an internet connection as the model and its dependencies get stored locally.

Python Code for Fibonacci Series

Below is a Python script to generate a Fibonacci sequence:

def fibonacci(n): 
    """ 
    This function prints the first n numbers of the Fibonacci sequence.
    
    Parameters:
    @param n (int): The number of elements in the Fibonacci sequence to print.

    Returns:
    None
    """ 
    # Initialize the first two numbers of the Fibonacci sequence.
    a, b = 0, 1 

    # Iterate over the range and generate Fibonacci sequence.
    for i in range(n): 
        print(a) 
        # Update the next number in the sequence
        a, b = b, a + b 

# Test function with first 10 numbers of the Fibonacci sequence.
if __name__ == "__main__": 
    fibonacci(10)

This Python code defines a simple fibonacci function that takes an integer argument and prints the first n numbers in the Fibonacci sequence. The Fibonacci sequence starts with 0 and 1, and each subsequent number is the sum of the previous two.

The if __name__ == "__main__": block at the end tests this function by calling it with a parameter value of 10, which prints out the first 10 numbers in the Fibonacci sequence.

Conclusion

Ollama is a new Gen-AI tool for working with large language models locally, offering enhanced privacy, customization, and offline accessibility. Ollama has led to working with LLMs being simpler, enabling users to explore and experiment with open-source LLMs directly on their machines. Ollama promotes innovation and a deeper understanding of AI.

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: