This is to automate the updating of proxmox host when there is a kernel update which will break the LXC link to the GPU.
It just requires you to reinstall the graphics driver and do a reboot otherwise.
This is after you have done an update / upgrade of your proxmox host. You will have to change the IP Addresses for your setup.
For me.. 10.77.69.2 – Proxmox Host 10.77.69.103 – LXC Plex
########- hosts: nvidiabecome: truebecome_user: roottasks: - name: Wait for 10.77.69.2 to become availablewait_for_connection:delay: 5timeout: 300 - name: Check if NVIDIA kernel module is loadedshell: lsmod | grep -q '^nvidia'register: nvidia_module_checkignore_errors: true - name: Set NVIDIA module check result as factset_fact:nvidia_module_rc: "{{ nvidia_module_check.rc }}" - name: Reinstall NVIDIA driver if module is not loadedshell: sh /root/NVIDIA-Linux-x86_64-535.154.05.run --silentargs:executable: /bin/bashwhen: nvidia_module_check.rc != 0 - name: Set fact if NVIDIA driver was installedset_fact:driver_installed: truewhen: nvidia_module_check.rc != 0 - name: Reboot system if NVIDIA driver was reinstalledreboot:when: nvidia_module_check.rc != 0 - name: Wait for 10.77.69.2 to become available after rebootwait_for_connection:delay: 10timeout: 600when: nvidia_module_check.rc != 0########- hosts: plexbecome: truebecome_user: roottasks: - name: Install NVIDIA driver in LXCshell: sh /root/NVIDIA-Linux-x86_64-535.154.05.run --no-kernel-module --silentargs:executable: /bin/bashwhen: hostvars['10.77.69.2'].driver_installed | default(false) - name: Reboot 10.77.69.103reboot:when: hostvars['10.77.69.2'].driver_installed | default(false) - name: Wait for 10.77.69.103 to become availablewait_for_connection:delay: 10timeout: 300when: hostvars['10.77.69.2'].driver_installed | default(false)
This will check to see if the kernels for nvidia (my gpu) has been loaded, if not it will reinstall in silent mode. This will also flag a GPU install in ansible to also reinstall the GPU driver in the LXC, only if it needs though.
1. Install host drivers When doing PCIe passthrough, the first step is to blacklist the driver to ensure the host kernel doesn’t try to load the device. Here we need to do the opposite: Install and configure the correct drivers.
You’ll need to install the actual nvidia drivers. The easiest way to do this is to download the driver from nvidia.com. This not only ensures you’re using the latest driver, but means it won’t accidentally update during a system update, as it’s important that the host and guest OS have the exact same driver version. You can still install it using the system package manager, just be aware of updates – especially if the guest and host OS are different distributions.
Next you’ll need to make sure the drivers are loaded on boot. To do this, edit the add the following to this file:
nano/etc/modules-load.d/modules.conf
# Nvidia modulesnvidianvidia_uvm
Once that’s done, you’ll need to update the initramfs with:
update-initramfs-u-kall
The final step is to add a udev rule to create the required device files for the nvidia driver, which for reasons aren’t created automatically. This is done in:
There’s my GPU being detected correctly, using driver version 450.80.02 – we’ll be needing this later.
2. Configure container Next, create your container. There’s nothing special about this process, just choose the OS and resource requirements for you.
Before starting your container, we need to make some changes to the config file directly to pass through the GPU. This config file will probably live in:
nano/etc/pve/lxc/<id>.conf# ID of the LXC
Where id is the id of your container. You need to add the following lines:
# Allow cgroup accesslxc.cgroup2.devices.allow:c195:*rwmlxc.cgroup2.devices.allow:c243:*rwm# Pass through device fileslxc.mount.entry:/dev/nvidia0dev/nvidia0nonebind,optional,create=filelxc.mount.entry:/dev/nvidiactldev/nvidiactlnonebind,optional,create=filelxc.mount.entry:/dev/nvidia-uvmdev/nvidia-uvmnonebind,optional,create=filelxc.mount.entry:/dev/nvidia-modesetdev/nvidia-modesetnonebind,optional,create=filelxc.mount.entry:/dev/nvidia-uvm-toolsdev/nvidia-uvm-toolsnonebind,optional,create=file
These lines allow the container to communicate with the nvidia driver, and pass through the control files needed for the guest to actually communicate with the GPU. These lines probably won’t work out the box, so we need to compare them to our actual control files:
Note: If you don’t see all 5 files, it probably means the drivers haven’t loaded correctly. Best check the logs.
These files are character devices (as shown by the c at the start of the line), which the kernel module uses to communicate with the hardware. lxc.mount.entry binds these into the container.
The lxc.cgroup2.devices.allow lines denote the cgroups which own the nvidia drivers. For the some files we have, 195:* will match the groups owning those, and the uvm files will match 243:*. If the config doesn’t match, you’ll need to change it. Note that the order doesn’t matter, so long as the cgroup lines are before the mounts.
3. Install guest drivers Now that the host is configured, and the control files passed through, the guest needs configuring.
The gist of the configuration is to also install the nvidia drivers, but without the kernel modules. The simplest way to do this is to use the same driver binary downloaded from nvidia.com, and run it with:
This shows the GPU is detected correctly, but doesn’t prove it’s working correctly. The best way to do this is to actually try and use it. For me this involved installing Jellyfin, loading in some content and checking the GPU was doing the transcoding, not the CPU – Which it was!
Because it’s simply passing through the device files rather than the actual PCIe device, you can repeat this process multiple times for multiple containers.