Proxmox
Proxmox home cluster
Proxmox Backup Server
ZFS stuck?
Terminate process with SSH tunnel if terminating zfs send process doesn’t work
Change VM ID range, next free ID for VM.
Datacenter → Options → Next Free VMID Range
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_next_id_range
Helpers and scripts
https://github.com/tteck/Proxmox
Host key verification failed
If migration fails, check if ssh_known_hosts is ok, also restart pveproxy service.
It also can cause ZFS replication errors.
https://forum.proxmox.com/threads/host-key-verification-failed-when-migrate.41666/
FS freeze thaw
# https://pve.proxmox.com/wiki/Qemu-guest-agent
qm guest cmd VMID fsfreeze-status
RAM balloon
if RAM usage on the host is below 80%, will dynamically add memory to the guest up to the maximum memory specified.
https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_memory
https://forum.proxmox.com/threads/ballooning-issues.69048/#post-309631
GPU passthrough
https://pve.proxmox.com/wiki/PCI_Passthrough
https://bobcares.com/blog/proxmox-gpu-passthrough/
https://forum.proxmox.com/threads/audio-is-same-iommu-group-as-ethernet-how-pass-through.128404/
https://forum.proxmox.com/threads/proxmox-8-1-3-gpu-passthrough-soooo-close.139341/
# nano /etc/default/grub
from
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
to
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
or
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on"
# Intel VT-d in BIOS should be enabled
# IOMMU should be enabled looking at dmesg
dmesg | grep -i 'IOMMU enabled'
# check current kernel launch cmd if iommu enabled
cat /proc/cmdline
# add intel iommu to kernel cmd without modfification of files
echo 'GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT intel_iommu=on"' > /etc/default/grub.d/iommu.cfg
update-grub
echo -e 'vfio\nvfio_iommu_type1\nvfio_pci\nvfio_virqfd' > /etc/modules-load.d/vfio.conf
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo "blacklist radeon" > /etc/modprobe.d/blacklist_gpu.conf
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist_gpu.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist_gpu.conf
# check vendor id codes for vga and audio
lspci | grep -i nvidia
lspci -n -s 01:00
# 10de:1cb3
# 10de:0fb9
echo "options vfio-pci ids=10de:1cb3,10de:0fb9 disable_vga=1" > /etc/modprobe.d/vfio.conf
update-initramfs -u
reboot
# TODO check if it's necessary
echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
root@mango:~# lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P400] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
root@mango:~# lspci -n -s 01:00
01:00.0 0300: 10de:1cb3 (rev a1)
01:00.1 0403: 10de:0fb9 (rev a1)
Debian 12 bookworm VM with nvidia GPU for Docker
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
# test if it works
docker run --rm --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi
# /etc/apt/sources.list.d/debian.sources
# Components: main contrib non-free non-free-firmware
apt update
# https://old.reddit.com/r/debian/comments/17vdj3c/install_nvidia_headless_driver/
apt install --no-install-recommends firmware-misc-nonfree nvidia-smi nvidia-driver nvidia-kernel-dkms linux-headers-amd64 libcuda1 nvidia-persistenced
# probably we need nvidia container toolkit too
# apt install nvidia-cuda-dev nvidia-cuda-toolkit
Ubuntu 22.04 VM with Nvidia GPU for Docker
# install clear Ubuntu Server 22.04, login as root and upgrade it to latest version
apt-get update
apt-get dist-upgrade
# blacklist default GPU drivers after checking
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
# search for latest proprietary Nvidia GPU driver in Ubuntu repo and install it
apt search nvidia-driver-5 | grep 'nvidia-driver-.*-server'
apt-get install -y nvidia-driver-535-server
update-initramfs -u
reboot
# check GPU status after rebooting
nvidia-smi
# install standard Docker
curl https://get.docker.com | sh && sudo systemctl --now enable docker
# install Docker with Nvidia GPU support from offical repo
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list > /etc/apt/sources.list.d/nvidia-docker.list
apt update
apt -y install nvidia-container-toolkit
systemctl restart docker
# test if GPU is visible and working from inside of container
docker run --rm --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi
# deploy ready to use containerized SD with easy to use Web GUI from
git clone https://github.com/AbdBarho/stable-diffusion-webui-docker
cd stable-diffusion-webui-docker/
docker compose --profile download up --build
docker compose --profile auto up -d --build
#
# visit http://IP:7860 of your Docker host for WebGUI
#
# Extras
# limit power to prevent faster wear of card
# setting is not persistent, add this to systemd service after boot if needed
nvidia-smi -pl 180
nvidia-smi -q -d POWER
VM io-error
If you are getting yellow triangle and “io-error” for VM in Proxmox, probably it’s just disk full.
In my case I made some snapshots for VM before upgrade, didn’t notice that ZFS replication would fill up disk of my smallest node.
Source: https://forum.proxmox.com/threads/crash-vm-with-io-error.118692/
Misc
Migrate to remote cluster
https://forum.proxmox.com/threads/sync-zfs-to-remote-zfs-datastore.120424/post-561444