-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown #147
Comments
OS is Ubuntu18.04,GPU is NVIDIA GeForce RTX 3090,NVIDIA driver version is 470.82.01
…------------------ 原始邮件 ------------------
发件人: "NVIDIA/nvidia-docker" ***@***.***>;
发送时间: 2022年2月28日(星期一) 中午1:07
***@***.***>;
***@***.******@***.***>;
主题: Re: [NVIDIA/nvidia-docker] nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown (Issue NVIDIA/nvidia-container-toolkit#147)
@Hurricane-eye what is your host configuration (i.e. distribution and version)?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Maybe there are too many versions of driver in my server ? dpkg -l | grep nvidia
ii libnvidia-cfg1-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-470-server 470.103.01-0ubuntu0.18.04.1 all Shared files used by the NVIDIA libraries
rc libnvidia-compute-470:amd64 470.86-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-compute-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.8.1-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.8.1-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 Extra libraries for the NVIDIA Server Driver
ii libnvidia-fbc1-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ifr1-470-server:amd64 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
rc nvidia-compute-utils-470 470.86-0ubuntu0.18.04.1 amd64 NVIDIA compute utilities
ii nvidia-compute-utils-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA compute utilities
ii nvidia-container-toolkit 1.8.1-1 amd64 NVIDIA container runtime hook
rc nvidia-dkms-470 470.86-0ubuntu0.18.04.1 amd64 NVIDIA DKMS package
ii nvidia-dkms-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA DKMS package
ii nvidia-docker2 2.9.1-1 all nvidia-docker CLI wrapper
ii nvidia-driver-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA Server Driver metapackage
rc nvidia-kernel-common-470 470.86-0ubuntu0.18.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-common-470-server 470.103.01-0ubuntu0.18.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime
ii nvidia-settings 470.57.01-0ubuntu0.18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA Server Driver support binaries
ii xserver-xorg-video-nvidia-470-server 470.103.01-0ubuntu0.18.04.1 amd64 NVIDIA binary Xorg driver
|
Hi! Have you found a solution for this yet? |
I creeated the symlink manually:
I had to do this inside the kind node:
|
|
What do you mean by validator. Note that on Ubuntu-based distributions where The next release of the NVIDIA Container Toolkit should allow these options to be detected in a more stable manner, ensuring that |
@elezar Thank you for the information on this. This should help me get to the bottom of this.
I am installing the Nvidia GPU Operator on Kind. I was looking at some options to get GPUs working with my cluster. The operator's validator pod Logs show a failed sym link attempt:
Pod status shows the error with
Creating a symlink "fixed" the error, but there is obviously more to it than that. Maybe there is an option with the Nvidia Toolkit that will resolve this. |
I'm using version v1.14.3 and experiencing the same issue as reported by others.
From https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.14.3/internal/config/config.go#L124-L129: func getLdConfigPath() string {
if _, err := os.Stat("/sbin/ldconfig.real"); err == nil {
return "@/sbin/ldconfig.real"
}
return "@/sbin/ldconfig"
} If I ssh into the node and check the existence of stat /sbin/ldconfig.real
stat: cannot statx '/sbin/ldconfig.real': No such file or directory But when looking at file # ...
[nvidia-container-cli]
environment = []
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/"
# ... It seems like function The only solution to fix this problem is creating a symlink, as stated by others: @elezar is there another way to configure the |
@cmontemuino are you also trying to run the GPU Operator in Kind? If not, what is your host OS on the node where the NVIDIA Container Toolkit is being configured? There may be an issue with how we're generating the config - Especially in the context of the GPU Operator - where we are detecting ldconfig.real in the ubuntu-based container instead of on the host. Note that deleting (or commenting) that option from the config should cause the right value to be detected when running the NVIDIA Container Runtime from the host. |
Hi @elezar, this is not KinD, but OracleLinux. uname -r
5.14.0-284.30.1.el9_2.x86_64 We install Kubernetes (rancher/rke2) + the nvidia driver only. Then the gpu operator as an Argo CD application. |
@cmontemuino other posters here have pointed out that they were using Kind. The symptom is the same though. Any host os where We should definitely make this more resillient, but for now you could consider switching to the |
Just wanted to pop in to say that `/sbin/ldconfig.real` doesn't exist on Debian 12 either. I have to symlink it for the gpu stuff to work properly.
…On Nov 13, 2023, 9:18 AM, at 9:18 AM, Evan Lezar ***@***.***> wrote:
@cmontemuino other posters here have pointed out that they were using
Kind. The symptom is the same though. Any host os where
`/sbin/ldconfig.real` does not exist will show this behavior when using
the default ubuntu-based base image.
We should definitely make this more resillient, but for now you could
consider switching to the `container-toolkit:{{VERSION}}-ubi8` image as
a workaround.
--
Reply to this email directly or view it on GitHub:
#147 (comment)
You are receiving this because you are subscribed to this thread.
Message ID:
***@***.***>
|
Yes, most (if not all) non-Ubuntu distributions don't have the |
Just wanted to comment that I've been fighting all week to get a GPU working in my k3s cluster using The piece that made the entire thing come together was the missing symlink.
Thank you!!! My setup is as follows: OS: Fedora Linux 38 (Thirty Eight) |
@llajas which version of the NVIDIA Container Toolkit are you using? |
[root@metal6 ~]# nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.15.0-rc.3
commit: 93e15bc641896a9dc51f297c856c824bf1f45d86 I installed this using |
Good point! I'm on RHEL 8.9, which is supported, and I'm having the same issue. Was fixed by creating the symlink manually. |
I have also been having this problem in Rocky 9.3 using Microk8s. The symlink work around fixes it. |
1. Issue or feature description
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown.
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: