Content

1 Prerequisites
2 What is Ollama?
3 The Power of NVIDIA H100 GPUs
4 Why Run LLMs with Ollama on H100 GPUs?
5 Setting Up Ollama with H100 GPUs
6 Handling Connection Errors in Ollama
7 Supported Models in Ollama
8 Python Code for Fibonacci Series
9 Conclusion

Vijona

26 Feb at 14:54

Guide to Running Large Language Models Using Ollama on H100 GPUs

This article is a guide to run Large Language Models using Ollama on H100 GPUs. Now, with support for NVIDIA H100 GPUs, users can accelerate AI/ML development, test, deploy, and optimize their applications seamlessly—without the need for extensive setup or maintenance typically associated with traditional platforms.

Ollama is an open source tool which provides access to a diverse library of pre-trained models, offers effortless installation and setup across different operating systems, and exposes a local API for seamless integration into applications and workflows. Users can customize and fine-tune LLMs, optimize performance with hardware acceleration, and benefit from interactive user interfaces for intuitive interactions.

Prerequisites

Access to H100 GPUs: Ensure you have access to NVIDIA H100 GPUs, either through on-premise hardware or using ccloud³ VMs equipped with Cloud GPUs
Supported Frameworks: Familiarity with Python and Linux Commands.
CUDA and cuDNN Installed: Ensure NVIDIA CUDA and cuDNN libraries are installed for optimal GPU performance.
Sufficient Storage and Memory: Have ample storage and memory available to handle large model datasets and weights.
Basic Understanding of LLMs: A foundational understanding of large language models and their structure to effectively manage and optimize them.

These prerequisites help ensure a smooth and efficient experience when running LLMs with Ollama on H100 GPUs.

What is Ollama?

Ollama offers a way to download a large language model from its vast language model library which consists of Llama3.1, Mistral, Code Llama, Gemma and much more. Ollama combines model weights, configuration, and data into one package, specified by a Modelfile.

Ollama provides a flexible platform for creating, importing, and using custom or pre-existing language models, ideal for creating chatbots, text summarization, and much more. It emphasizes privacy, integrates seamlessly with Windows, macOS, and Linux, and is free to use. Ollama also allows users to deploy models locally with ease. Further, the platform also supports real-time interactions via a REST API.

It’s perfect for LLM-powered web apps and tools. It’s very similar to how Docker works. With Docker, we can grab different images from a central hub and run them in containers. Furthermore, Ollama allows us to customize the models by creating a Modelfile. Below is the code to create Modelfile:

Copy Code

From llama2 # Set the temperature PARAMETER temperature 1 # Set the system Prompt SYSTEM """ You are a helpful teaching assistant created by DO. Answer questions asked based on Artificial Intelligence, Deep Learning. """

Next, run the custom model,

Copy Code

Ollama create MLexp \-f ./Modelfile Ollama run MLexp

The Power of NVIDIA H100 GPUs

The H100 is Nvidia’s most powerful GPU, specially designed for artificial intelligence applications. With 80 billion transistors—six times more than the A100—it can process large data sets much faster than other GPUs on the market.

As we all know, AI applications are data hungry and are computationally expensive. To manage this huge amount of workload, H100 are considered to be the best choice.

The H100 features fourth-generation tensor cores and a transformer engine with FP8 precision. The H100 triples the floating-point operations per second (FLOPS) compared to previous models, delivering 60 teraflops of double-precision (FP64) computing, which is crucial for precise calculations in HPC tasks. It can perform single-precision matrix-multiply operations at one petaflop throughput using TF32 precision without requiring any changes to existing code, making it user-friendly for developers.

The H100 introduces DPX instructions that significantly boost performance for dynamic programming tasks, achieving 7X better performance than the A100 and 40X faster than CPUs for specific algorithms like DNA sequence alignment.

H100 GPUs provide the necessary computational power, offering 3 terabytes per second (TB/s) of memory bandwidth per GPU. This high performance allows for efficient handling of large datasets.

The H100 supports scalability through technologies like NVLink and NVSwitch™, which allows multiple GPUs to work together effectively.

Why Run LLMs with Ollama on H100 GPUs?

To run Ollama efficiently, a GPU from NVIDIA is required to run things hassle-free. As with CPU, users can expect a slow response.

H100, due to its advanced architecture, offers exceptional computing power, which helps to significantly speed up the efficiency of LLMs.
Ollama lets users customize and fine-tune LLMs to meet their specific needs, enabling prompt engineering, few-shot learning, and tailored fine-tuning to align models with desired outcomes. Pairing Ollama with H100 GPUs enhances model inference and training times for developers and researchers.
H100 GPUs have the capacity to handle models such as Falcon 180b, which makes them ideal for creating and deploying Gen AI tools like chatbots or RAG applications.
H100 GPUs come with hardware optimizations like tensor cores, which significantly accelerate tasks involving LLMs, especially when dealing with matrix-heavy operations.

Setting Up Ollama with H100 GPUs

Ollama is very well compatible with Windows, macOS, or Linux. Here we are using Linux code as our example is based on Linux OS.

Run the code below in your terminal to check the GPU specification.

Copy Code

nvidia-smi

Next, we will try to install Ollama first using the same terminal.

Copy Code

curl \-fsSL https://ollama.com/install.sh | sh

This will instantly start the Ollama installation.

Once the installation is done, we can pull any LLM and start working with the model such as Llama 3.1, Phi3, Mistral, Gemma 2, or any other model.

To run and chat with models, we will run the below code. Please feel free to change the model as per your requirements. Running the model with Ollama is quite straightforward, and here we are using the powerful H100, the process to generate a response becomes fast and efficient.

Copy Code

ollama run example_model ollama run qwen2:7b

Handling Connection Errors in Ollama

In case of the error “could not connect to ollama app, is it running?” Please use the below code to connect to Ollama:

Copy Code

sudo systemctl enable ollama sudo systemctl start ollama

Supported Models in Ollama

Ollama supports a wide list of models. Here are some example models that can be downloaded and used:

Model	Parameters	Size	Download Command
Llama 3.1	8B	4.7GB	Ollama run llama3.1
Llama 3.1	70B	40GB	Ollama run llama3.1:70b
Llama 3.1	405B	231GB	Ollama run llama3.1:405b
Phi 3 Mini	3.8B	2.3GB	Ollama run phi3
Phi 3 Medium	14B	7.9GB	Ollama run phi3:medium
Gemma 2	27B	16GB	Ollama run gemma2:27b
Mistral	7B	4.1GB	Ollama run mistral
Code Llama	7B	3.8GB	Ollama run codellama

With Ollama, users can run LLMs conveniently without even the need for an internet connection as the model and its dependencies get stored locally.

Python Code for Fibonacci Series

Below is a Python script to generate a Fibonacci sequence:

Copy Code


def fibonacci(n): 
    """ 
    This function prints the first n numbers of the Fibonacci sequence.
    
    Parameters:
    @param n (int): The number of elements in the Fibonacci sequence to print.

    Returns:
    None
    """ 
    # Initialize the first two numbers of the Fibonacci sequence.
    a, b = 0, 1 

    # Iterate over the range and generate Fibonacci sequence.
    for i in range(n): 
        print(a) 
        # Update the next number in the sequence
        a, b = b, a + b 

# Test function with first 10 numbers of the Fibonacci sequence.
if __name__ == "__main__": 
    fibonacci(10)

This Python code defines a simple fibonacci function that takes an integer argument and prints the first n numbers in the Fibonacci sequence. The Fibonacci sequence starts with 0 and 1, and each subsequent number is the sum of the previous two.

The if __name__ == "__main__": block at the end tests this function by calling it with a parameter value of 10, which prints out the first 10 numbers in the Fibonacci sequence.

Conclusion

Ollama is a new Gen-AI tool for working with large language models locally, offering enhanced privacy, customization, and offline accessibility. Ollama has led to working with LLMs being simpler, enabling users to explore and experiment with open-source LLMs directly on their machines. Ollama promotes innovation and a deeper understanding of AI.

Create a Free Account

Try now

Posts you might be interested in:

How to Optimize the Performance of a Flask Application – Best Practices & Tools

Python, Tutorial

1 day ago

Optimizing Flask Application Performance Flask is a lightweight and flexible web framework for building small—to medium-sized applications. It’s commonly used in projects ranging from simple personal blogs to more complex…

How To Perform Unit Testing in Flask – A Step-by-Step Guide

Python, Tutorial

1 day ago

How To Perform Unit Testing in Flask Testing is essential to the software development process, ensuring that code behaves as expected and is defect-free. In Python, pytest is a popular…

How to Choose the Right Log Shipper for OpenSearch

Security, Tutorial

1 day ago

Log Shippers: A Comparison of Logstash, Filebeat, Fluentd, and Fluent Bit Log shippers are essential tools in modern log management and observability ecosystems, enabling the collection, processing, and forwarding of…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

Guide to Running Large Language Models Using Ollama on H100 GPUs

Prerequisites

What is Ollama?

The Power of NVIDIA H100 GPUs

Why Run LLMs with Ollama on H100 GPUs?

Setting Up Ollama with H100 GPUs

Handling Connection Errors in Ollama

Supported Models in Ollama

Python Code for Fibonacci Series

Conclusion

Create a Free Account

Posts you might be interested in:

How to Optimize the Performance of a Flask Application – Best Practices & Tools

How To Perform Unit Testing in Flask – A Step-by-Step Guide