Proxmox home cluster
Proxmox Backup Server
ZFS stuck?
Terminate process with SSH tunnel if terminating zfs send process doesn’t work
Change VM ID range, next free ID for VM.
Datacenter → Options → Next Free VMID Range
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_next_id_range
Helpers and scripts
https://github.com/tteck/Proxmox
Host key verification failed
If migration fails, check if ssh_known_hosts is ok, also restart pveproxy service.
It also can cause ZFS replication errors.
https://forum.proxmox.com/threads/host-key-verification-failed-when-migrate.41666/
FS freeze thaw
# https://pve.proxmox.com/wiki/Qemu-guest-agent
qm guest cmd VMID fsfreeze-status
RAM balloon
if RAM usage on the host is below 80%, will dynamically add memory to the guest up to the maximum memory specified.
https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_memory
https://forum.proxmox.com/threads/ballooning-issues.69048/#post-309631
GPU passthrough
https://pve.proxmox.com/wiki/PCI_Passthrough
https://bobcares.com/blog/proxmox-gpu-passthrough/
https://forum.proxmox.com/threads/audio-is-same-iommu-group-as-ethernet-how-pass-through.128404/
https://forum.proxmox.com/threads/proxmox-8-1-3-gpu-passthrough-soooo-close.139341/
# nano /etc/default/grub
from
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
to
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
or
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on"
# Intel VT-d in BIOS should be enabled
# IOMMU should be enabled looking at dmesg
dmesg | grep -i 'IOMMU enabled'
# check current kernel launch cmd if iommu enabled
cat /proc/cmdline
# add intel iommu to kernel cmd without modfification of files
echo 'GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT intel_iommu=on"' > /etc/default/grub.d/iommu.cfg
update-grub
echo -e 'vfio\nvfio_iommu_type1\nvfio_pci\nvfio_virqfd' > /etc/modules-load.d/vfio.conf
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo "blacklist radeon" > /etc/modprobe.d/blacklist_gpu.conf
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist_gpu.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist_gpu.conf
# check vendor id codes for vga and audio
lspci | grep -i nvidia
lspci -n -s 01:00
# 10de:1cb3
# 10de:0fb9
echo "options vfio-pci ids=10de:1cb3,10de:0fb9 disable_vga=1" > /etc/modprobe.d/vfio.conf
update-initramfs -u
reboot
# TODO check if it's necessary
echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
root@mango:~# lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P400] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
root@mango:~# lspci -n -s 01:00
01:00.0 0300: 10de:1cb3 (rev a1)
01:00.1 0403: 10de:0fb9 (rev a1)
Debian 12 bookworm VM with nvidia GPU for Docker
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
# test if it works
docker run --rm --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi
# /etc/apt/sources.list.d/debian.sources
# Components: main contrib non-free non-free-firmware
apt update
# https://old.reddit.com/r/debian/comments/17vdj3c/install_nvidia_headless_driver/
apt install --no-install-recommends firmware-misc-nonfree nvidia-smi nvidia-driver nvidia-kernel-dkms linux-headers-amd64 libcuda1 nvidia-persistenced
# probably we need nvidia container toolkit too
# apt install nvidia-cuda-dev nvidia-cuda-toolkit
Ubuntu 22.04 VM with Nvidia GPU for Docker
# install clear Ubuntu Server 22.04, login as root and upgrade it to latest version
apt-get update
apt-get dist-upgrade
# blacklist default GPU drivers after checking
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
# search for latest proprietary Nvidia GPU driver in Ubuntu repo and install it
apt search nvidia-driver-5 | grep 'nvidia-driver-.*-server'
apt-get install -y nvidia-driver-535-server
update-initramfs -u
reboot
# check GPU status after rebooting
nvidia-smi
# install standard Docker
curl https://get.docker.com | sh && sudo systemctl --now enable docker
# install Docker with Nvidia GPU support from offical repo
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list > /etc/apt/sources.list.d/nvidia-docker.list
apt update
apt -y install nvidia-container-toolkit
systemctl restart docker
# test if GPU is visible and working from inside of container
docker run --rm --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi
# deploy ready to use containerized SD with easy to use Web GUI from
git clone https://github.com/AbdBarho/stable-diffusion-webui-docker
cd stable-diffusion-webui-docker/
docker compose --profile download up --build
docker compose --profile auto up -d --build
#
# visit http://IP:7860 of your Docker host for WebGUI
#
# Extras
# limit power to prevent faster wear of card
# setting is not persistent, add this to systemd service after boot if needed
nvidia-smi -pl 180
nvidia-smi -q -d POWER
VM io-error
If you are getting yellow triangle and “io-error” for VM in Proxmox, probably it’s just disk full.
In my case I made some snapshots for VM before upgrade, didn’t notice that ZFS replication would fill up disk of my smallest node.
Source: https://forum.proxmox.com/threads/crash-vm-with-io-error.118692/
Misc
Migrate to remote cluster
https://forum.proxmox.com/threads/sync-zfs-to-remote-zfs-datastore.120424/post-561444