Proxmox home cluster
Proxmox Backup Server

ZFS stuck?
Terminate process with SSH tunnel if terminating zfs send process doesn’t work

Change VM ID range, next free ID for VM.
Datacenter Options Next Free VMID Range
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_next_id_range

Helpers and scripts
https://github.com/tteck/Proxmox

Host key verification failed

If migration fails, check if ssh_known_hosts is ok, also restart pveproxy service.
It also can cause ZFS replication errors.
https://forum.proxmox.com/threads/host-key-verification-failed-when-migrate.41666/

FS freeze thaw

# https://pve.proxmox.com/wiki/Qemu-guest-agent
qm guest cmd VMID fsfreeze-status

RAM balloon

if RAM usage on the host is below 80%, will dynamically add memory to the guest up to the maximum memory specified.
https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_memory
https://forum.proxmox.com/threads/ballooning-issues.69048/#post-309631

GPU passthrough

https://pve.proxmox.com/wiki/PCI_Passthrough
https://bobcares.com/blog/proxmox-gpu-passthrough/
https://forum.proxmox.com/threads/audio-is-same-iommu-group-as-ethernet-how-pass-through.128404/
https://forum.proxmox.com/threads/proxmox-8-1-3-gpu-passthrough-soooo-close.139341/

# nano /etc/default/grub

from
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
to
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
or
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on"
# Intel VT-d in BIOS should be enabled
# IOMMU should be enabled looking at dmesg
dmesg | grep -i 'IOMMU enabled'
 
# check current kernel launch cmd if iommu enabled
cat /proc/cmdline
 
# add intel iommu to kernel cmd without modfification of files
echo 'GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT intel_iommu=on"' > /etc/default/grub.d/iommu.cfg
update-grub
 
echo -e 'vfio\nvfio_iommu_type1\nvfio_pci\nvfio_virqfd' > /etc/modules-load.d/vfio.conf
 
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
 
echo "blacklist radeon" > /etc/modprobe.d/blacklist_gpu.conf
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist_gpu.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist_gpu.conf
 
# check vendor id codes for vga and audio
lspci | grep -i nvidia
lspci -n -s 01:00
# 10de:1cb3
# 10de:0fb9
 
echo "options vfio-pci ids=10de:1cb3,10de:0fb9 disable_vga=1" > /etc/modprobe.d/vfio.conf
 
update-initramfs -u
reboot
 
# TODO check if it's necessary
echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
root@mango:~# lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P400] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)

root@mango:~# lspci -n -s 01:00
01:00.0 0300: 10de:1cb3 (rev a1)
01:00.1 0403: 10de:0fb9 (rev a1)

Debian 12 bookworm VM with nvidia GPU for Docker

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
 
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
 
# test if it works
docker run --rm --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi
 
 
# /etc/apt/sources.list.d/debian.sources
# Components: main contrib non-free non-free-firmware
 
apt update
# https://old.reddit.com/r/debian/comments/17vdj3c/install_nvidia_headless_driver/
apt install --no-install-recommends firmware-misc-nonfree nvidia-smi nvidia-driver nvidia-kernel-dkms linux-headers-amd64 libcuda1 nvidia-persistenced
 
# probably we need nvidia container toolkit too
# apt install nvidia-cuda-dev nvidia-cuda-toolkit

Ubuntu 22.04 VM with Nvidia GPU for Docker

# install clear Ubuntu Server 22.04, login as root and upgrade it to latest version
apt-get update
apt-get dist-upgrade
 
# blacklist default GPU drivers after checking
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
 
# search for latest proprietary Nvidia GPU driver in Ubuntu repo and install it
apt search nvidia-driver-5 | grep 'nvidia-driver-.*-server'
apt-get install -y nvidia-driver-535-server
update-initramfs -u
reboot
# check GPU status after rebooting
nvidia-smi
 
# install standard Docker
curl https://get.docker.com | sh && sudo systemctl --now enable docker
# install Docker with Nvidia GPU support from offical repo
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list > /etc/apt/sources.list.d/nvidia-docker.list
apt update
apt -y install nvidia-container-toolkit
systemctl restart docker
# test if GPU is visible and working from inside of container
docker run --rm --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi
 
# deploy ready to use containerized SD with easy to use Web GUI from 
git clone https://github.com/AbdBarho/stable-diffusion-webui-docker
cd stable-diffusion-webui-docker/
docker compose --profile download up --build
docker compose --profile auto up -d --build
 
#
# visit http://IP:7860 of your Docker host for WebGUI
#
 
# Extras
# limit power to prevent faster wear of card
# setting is not persistent, add this to systemd service after boot if needed
nvidia-smi -pl 180
nvidia-smi -q -d POWER​

VM io-error

If you are getting yellow triangle and “io-error” for VM in Proxmox, probably it’s just disk full.
In my case I made some snapshots for VM before upgrade, didn’t notice that ZFS replication would fill up disk of my smallest node.

image

Source: https://forum.proxmox.com/threads/crash-vm-with-io-error.118692/

Misc

Migrate to remote cluster
https://forum.proxmox.com/threads/sync-zfs-to-remote-zfs-datastore.120424/post-561444