Ansible – updating proxmox host kernel with LXC shared GPU

This is to automate the updating of proxmox host when there is a kernel update which will break the LXC link to the GPU.

It just requires you to reinstall the graphics driver and do a reboot otherwise.

This is after you have done an update / upgrade of your proxmox host.
You will have to change the IP Addresses for your setup.

For me..
10.77.69.2 – Proxmox Host
10.77.69.103 – LXC Plex

########
- hosts: nvidia
  become: true
  become_user: root
  tasks:
    - name: Wait for 10.77.69.2 to become available
      wait_for_connection:
        delay: 5
        timeout: 300

    - name: Check if NVIDIA kernel module is loaded
      shell: lsmod | grep -q '^nvidia'
      register: nvidia_module_check
      ignore_errors: true

    - name: Set NVIDIA module check result as fact
      set_fact:
        nvidia_module_rc: "{{ nvidia_module_check.rc }}"

    - name: Reinstall NVIDIA driver if module is not loaded
      shell: sh /root/NVIDIA-Linux-x86_64-535.154.05.run --silent
      args:
        executable: /bin/bash
      when: nvidia_module_check.rc != 0

    - name: Set fact if NVIDIA driver was installed
      set_fact:
        driver_installed: true
      when: nvidia_module_check.rc != 0

    - name: Reboot system if NVIDIA driver was reinstalled
      reboot:
      when: nvidia_module_check.rc != 0

    - name: Wait for 10.77.69.2 to become available after reboot
      wait_for_connection:
        delay: 10
        timeout: 600
      when: nvidia_module_check.rc != 0

########
- hosts: plex
  become: true
  become_user: root
  tasks:
    - name: Install NVIDIA driver in LXC
      shell: sh /root/NVIDIA-Linux-x86_64-535.154.05.run --no-kernel-module --silent
      args:
        executable: /bin/bash
      when: hostvars['10.77.69.2'].driver_installed | default(false)

    - name: Reboot 10.77.69.103
      reboot:
      when: hostvars['10.77.69.2'].driver_installed | default(false)

    - name: Wait for 10.77.69.103 to become available
      wait_for_connection:
        delay: 10
        timeout: 300
      when: hostvars['10.77.69.2'].driver_installed | default(false)


This will check to see if the kernels for nvidia (my gpu) has been loaded, if not it will reinstall in silent mode. This will also flag a GPU install in ansible to also reinstall the GPU driver in the LXC, only if it needs though.

Nvidia GPU passthrough in LXC

Link to original article here.

1. Install host drivers
When doing PCIe passthrough, the first step is to blacklist the driver to ensure the host kernel doesn’t try to load the device. Here we need to do the opposite: Install and configure the correct drivers.

You’ll need to install the actual nvidia drivers. The easiest way to do this is to download the driver from nvidia.com. This not only ensures you’re using the latest driver, but means it won’t accidentally update during a system update, as it’s important that the host and guest OS have the exact same driver version. You can still install it using the system package manager, just be aware of updates – especially if the guest and host OS are different distributions.

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/450.80.02/NVIDIA-Linux-x86_64-450.80.02.run
sh NVIDIA*


Next you’ll need to make sure the drivers are loaded on boot. To do this, edit the add the following to this file:

nano /etc/modules-load.d/modules.conf
# Nvidia modules
nvidia
nvidia_uvm


Once that’s done, you’ll need to update the initramfs with:

update-initramfs -u -k all


The final step is to add a udev rule to create the required device files for the nvidia driver, which for reasons aren’t created automatically. This is done in:

nano /etc/udev/rules.d/70-nvidia.rules
KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"


Now you can reboot, and check the GPU is being detected correctly with:

nvidia-smi


The output should look like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 760     Off  | 00000000:03:00.0 N/A |                  N/A |
|  0%   34C    P0    N/A /  N/A |      0MiB /  1996MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


There’s my GPU being detected correctly, using driver version 450.80.02 – we’ll be needing this later.

2. Configure container
Next, create your container. There’s nothing special about this process, just choose the OS and resource requirements for you.

Before starting your container, we need to make some changes to the config file directly to pass through the GPU. This config file will probably live in:

nano /etc/pve/lxc/<id>.conf # ID of the LXC


Where id is the id of your container. You need to add the following lines:

# Allow cgroup access
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 243:* rwm

# Pass through device files
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file


These lines allow the container to communicate with the nvidia driver, and pass through the control files needed for the guest to actually communicate with the GPU. These lines probably won’t work out the box, so we need to compare them to our actual control files:

ls -l /dev/nvidia*


The output should look like:

crw-rw-rw- 1 root root 195, 254 Dec 22 20:51 /dev/nvidia-modeset
crw-rw-rw- 1 root root 243,   0 Dec 22 20:51 /dev/nvidia-uvm
crw-rw-rw- 1 root root 243,   1 Dec 22 20:51 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Dec 22 20:51 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec 22 20:51 /dev/nvidiactl


Note: If you don’t see all 5 files, it probably means the drivers haven’t loaded correctly. Best check the logs.

These files are character devices (as shown by the c at the start of the line), which the kernel module uses to communicate with the hardware. lxc.mount.entry binds these into the container.

The lxc.cgroup2.devices.allow lines denote the cgroups which own the nvidia drivers. For the some files we have, 195:* will match the groups owning those, and the uvm files will match 243:*. If the config doesn’t match, you’ll need to change it. Note that the order doesn’t matter, so long as the cgroup lines are before the mounts.

3. Install guest drivers
Now that the host is configured, and the control files passed through, the guest needs configuring.

The gist of the configuration is to also install the nvidia drivers, but without the kernel modules. The simplest way to do this is to use the same driver binary downloaded from nvidia.com, and run it with:

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/450.80.02/NVIDIA-Linux-x86_64-450.80.02.run
sh NVIDIA* --no-kernel-module


4. Test it
Now, from your container, you should be able to run nvidia-smi, and it’ll show the right version GPU and driver:

nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 760     Off  | 00000000:03:00.0 N/A |                  N/A |
|  0%   34C    P0    N/A /  N/A |      0MiB /  1996MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


This shows the GPU is detected correctly, but doesn’t prove it’s working correctly. The best way to do this is to actually try and use it. For me this involved installing Jellyfin, loading in some content and checking the GPU was doing the transcoding, not the CPU – Which it was!

Because it’s simply passing through the device files rather than the actual PCIe device, you can repeat this process multiple times for multiple containers.