Skip to content
This repository was archived by the owner on Jan 22, 2024. It is now read-only.

Docker error #1424

Closed
9 tasks
stephcar75020 opened this issue Nov 28, 2020 · 9 comments
Closed
9 tasks

Docker error #1424

stephcar75020 opened this issue Nov 28, 2020 · 9 comments

Comments

@stephcar75020
Copy link

Hello following the installation describe i'm experiencing this error
if someone have any clue i will apreciate !

1. Issue or feature description

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\n\""": unknown.

2. Steps to reproduce the issue

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info

root@NAS:~# nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I1128 11:14:19.145391 8012 nvc.c:282] initializing library context (version=1.3.0, build=16315ebdf4b9728e899f615e208b50c41d7a5d15)
I1128 11:14:19.145543 8012 nvc.c:256] using root /
I1128 11:14:19.145564 8012 nvc.c:257] using ldcache /etc/ld.so.cache
I1128 11:14:19.145577 8012 nvc.c:258] using unprivileged user 65534:65534
I1128 11:14:19.145618 8012 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1128 11:14:19.145925 8012 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
I1128 11:14:19.149435 8013 nvc.c:192] loading kernel module nvidia
I1128 11:14:19.149919 8013 nvc.c:204] loading kernel module nvidia_uvm
I1128 11:14:19.150165 8013 nvc.c:212] loading kernel module nvidia_modeset
I1128 11:14:19.150806 8014 driver.c:101] starting driver service
I1128 11:14:19.154868 8012 nvc_info.c:680] requesting driver information with ''
I1128 11:14:19.157758 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.450.80.02
I1128 11:14:19.158007 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.450.80.02
I1128 11:14:19.158136 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.450.80.02
I1128 11:14:19.158251 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.450.80.02
I1128 11:14:19.158361 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.80.02
I1128 11:14:19.158510 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.450.80.02
I1128 11:14:19.158665 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.450.80.02
I1128 11:14:19.158770 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.450.80.02
I1128 11:14:19.158874 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.80.02
I1128 11:14:19.159024 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.450.80.02
I1128 11:14:19.159190 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.450.80.02
I1128 11:14:19.159291 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.450.80.02
I1128 11:14:19.159391 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.80.02
I1128 11:14:19.159491 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.450.80.02
I1128 11:14:19.159646 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.450.80.02
I1128 11:14:19.159794 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.450.80.02
I1128 11:14:19.159897 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.450.80.02
I1128 11:14:19.160001 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.450.80.02
I1128 11:14:19.160158 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.450.80.02
I1128 11:14:19.160261 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.450.80.02
I1128 11:14:19.160412 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.450.80.02
I1128 11:14:19.160775 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.450.80.02
I1128 11:14:19.161000 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.450.80.02
I1128 11:14:19.161106 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.450.80.02
I1128 11:14:19.161210 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.450.80.02
I1128 11:14:19.161320 8012 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.450.80.02
W1128 11:14:19.161375 8012 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W1128 11:14:19.161396 8012 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W1128 11:14:19.161412 8012 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W1128 11:14:19.161432 8012 nvc_info.c:354] missing compat32 library libcuda.so
W1128 11:14:19.161447 8012 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W1128 11:14:19.161463 8012 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W1128 11:14:19.161479 8012 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W1128 11:14:19.161495 8012 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W1128 11:14:19.161512 8012 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W1128 11:14:19.161528 8012 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W1128 11:14:19.161544 8012 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W1128 11:14:19.161560 8012 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W1128 11:14:19.161576 8012 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W1128 11:14:19.161591 8012 nvc_info.c:354] missing compat32 library libnvcuvid.so
W1128 11:14:19.161607 8012 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W1128 11:14:19.161623 8012 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W1128 11:14:19.161695 8012 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W1128 11:14:19.161711 8012 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W1128 11:14:19.161728 8012 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W1128 11:14:19.161744 8012 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W1128 11:14:19.161761 8012 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W1128 11:14:19.161780 8012 nvc_info.c:354] missing compat32 library libnvoptix.so
W1128 11:14:19.161797 8012 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W1128 11:14:19.161815 8012 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W1128 11:14:19.161834 8012 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W1128 11:14:19.161851 8012 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W1128 11:14:19.161865 8012 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W1128 11:14:19.161883 8012 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I1128 11:14:19.162493 8012 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I1128 11:14:19.162554 8012 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I1128 11:14:19.162612 8012 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I1128 11:14:19.162666 8012 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I1128 11:14:19.162721 8012 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
I1128 11:14:19.162790 8012 nvc_info.c:438] listing device /dev/nvidiactl
I1128 11:14:19.162808 8012 nvc_info.c:438] listing device /dev/nvidia-uvm
I1128 11:14:19.162825 8012 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I1128 11:14:19.162842 8012 nvc_info.c:438] listing device /dev/nvidia-modeset
W1128 11:14:19.162909 8012 nvc_info.c:321] missing ipc /var/run/nvidia-persistenced/socket
W1128 11:14:19.162961 8012 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I1128 11:14:19.162981 8012 nvc_info.c:745] requesting device information with ''
I1128 11:14:19.171184 8012 nvc_info.c:628] listing device /dev/nvidia0 (GPU-37e266f6-df13-0be2-c23e-db4912d5ac36 at 00000000:01:00.0)
NVRM version: 450.80.02
CUDA version: 11.0

Device Index: 0
Device Minor: 0
Model: GeForce GTX 1660
Brand: GeForce
GPU UUID: GPU-37e266f6-df13-0be2-c23e-db4912d5ac36
Bus Location: 00000000:01:00.0
Architecture: 7.5
I1128 11:14:19.171290 8012 nvc.c:337] shutting down library context
I1128 11:14:19.172039 8014 driver.c:156] terminating driver service
I1128 11:14:19.172544 8012 driver.c:196] driver service terminated successfully

  • Kernel version from uname -a

root@NAS:# uname -a
Linux NAS.WORKGROUP 5.8.0-0.bpo.2-amd64 #1 SMP Debian 5.8.10-1
bpo10+1 (2020-09-26) x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg

  • Driver information from nvidia-smi -a

root@NAS:~# nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Sat Nov 28 12:29:24 2020
Driver Version : 450.80.02
CUDA Version : 11.0
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce GTX 1660
Product Brand : GeForce
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-37e266f6-df13-0be2-c23e-db4912d5ac36
Minor Number : 0
VBIOS Version : 90.16.34.00.22
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x218410DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x218410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 29 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 5944 MiB
Used : 3 MiB
Free : 5941 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 33 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 91 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 9.28 W
Power Limit : 130.00 W
Default Power Limit : 130.00 W
Enforced Power Limit : 130.00 W
Min Power Limit : 70.00 W
Max Power Limit : 140.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2145 MHz
SM : 2145 MHz
Memory : 4001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

  • Docker version from docker version

docker version
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:02:55 2020
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:01:25 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.7
GitCommit: 8fba4e9a7d01810a393d5d25a3621dc101981175
nvidia:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

root@NAS:~# dpkg -l 'nvidia'
Souhait=inconnU/Installé/suppRimé/Purgé/H=à garder
| État=Non/Installé/fichier-Config/dépaqUeté/échec-conFig/H=semi-installé/W=attend-traitement-déclenchements
|/ Err?=(aucune)/besoin Réinstallation (État,Err: majuscule=mauvais)
||/ Nom Version Architecture Description
+++-=============================-============-============-=====================================================
ii libnvidia-container-tools 1.3.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.3.0-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.4.0-1 amd64 NVIDIA container runtime
un nvidia-container-runtime-hook (aucune description n'est disponible)
ii nvidia-container-toolkit 1.3.0-1 amd64 NVIDIA container runtime hook
ii nvidia-detect 418.152.00-1 amd64 NVIDIA GPU detection utility
un nvidia-docker (aucune description n'est disponible)
ii nvidia-docker2 2.5.0-1 all nvidia-docker CLI wrapper

  • NVIDIA container library version from nvidia-container-cli -V

root@NAS:~# nvidia-container-cli -V
version: 1.3.0
build date: 2020-09-16T12:33+00:00
build revision: 16315ebdf4b9728e899f615e208b50c41d7a5d15
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used
@klueska
Copy link
Contributor

klueska commented Dec 1, 2020

This seems related to this (which I am unable to reproduce):
#1399

Can you give me more info on your system so I can see if I can reproduce the error? Without a way to reproduce it I will never be able to make progress on it.

@stephcar75020
Copy link
Author

This seems related to this (which I am unable to reproduce):
#1399

Can you give me more info on your system so I can see if I can reproduce the error? Without a way to reproduce it I will never be able to make progress on it.

hello Klueska
here is some informations
MB : Asus P9X79 Pro
Intel I7 - 3930K
32 GB ram
graphic card : Palit GTX 1660 Dual OC GTX 1660/6Go
the system was installed using the package provided by openmediaVault- https://www.openmediavault.org/
i would be happy to help you more but i'm a real newbie ... my frist strike with linux debian
so feel free to ask me what you need !

what i can say is when i run
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
i got the following error :
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\n\""": unknown.

but when i run
docker run --rm --gpus all nvidia/cuda:10.0-base nvidia-smi
i got this answer :
root@NAS:~# docker run --rm --gpus all nvidia/cuda:10.0-base nvidia-smi
Tue Dec 1 22:45:02 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1660 Off | 00000000:01:00.0 Off | N/A |
| 29% 35C P0 21W / 130W | 1325MiB / 5944MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|

hope it could be helpfull for you

@klueska
Copy link
Contributor

klueska commented Dec 2, 2020

Interesting. So it works with a cuda 10 image, but not with a cuda 11 image.

@bettodiaz
Copy link

Hello. I am having the exact same issue as reported by @stephcar75020
I am also using Debian buster, all up to date, as part of OpenMediaVault.
Error message is the same.

Here is my system setup:

root@OMV:# nvidia-container-cli -V
version: 1.3.0
build date: 2020-09-16T12:33+00:00
build revision: 16315ebdf4b9728e899f615e208b50c41d7a5d15
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
root@OMV:
# docker version
Client: Docker Engine - Community
Version: 19.03.14
API version: 1.40
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:20:22 2020
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.14
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:18:50 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.9
GitCommit: ea765aba0d05254012b0b9e595e995c09186427f
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
root@OMV:~# nvidia-smi
Wed Dec 2 19:49:22 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 165... Off | 00000000:01:00.0 Off | N/A |
| 62% 39C P0 12W / 100W | 0MiB / 3909MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@OMV:~#

@klueska
Copy link
Contributor

klueska commented Dec 7, 2020

The fact that it works with a 10.0 image, but breaks with an 11.0 image suggests that it has something to do with ldconfig failing on execution when running against an 11.0 image (and not against a 10.0 image).

One thing that may be causing this is the following:
NVIDIA/libnvidia-container#117 (comment)

When this bug is triggered, if the compat libraries fail to be detected, then running ldconfig will throw an error (even though it usually sets up the libraries anyway). Maybe for some reason on debian 10 it fails but also doesn't set up the libraries.

Can you install the RC at the following link to see if it fixes this issue for you?

Link to package:
https://drive.google.com/file/d/1OZAnGMoo9Z6oHLUOAdCv6aDVIcNJvbP_/view?usp=sharing

Command to install:

sudo dpkg -i libnvidia-container1_1.3.1~rc.test-1_amd64.deb

@klueska
Copy link
Contributor

klueska commented Dec 8, 2020

So I was able to reproduce this with the following setting in /etc/nvidia-container-runtime/config.toml

ldconfig = "/sbin/ldconfig"

By default this should be the following on debian systems:

ldconfig = "@/sbin/ldconfig"

The first will attempt to run /sbin/ldconfig from inside the container, while the second will attempt to run /sbin/ldconfig from the host file system. The second is preferable, because you never know exactly what will be installed on every container you run.

It's unclear exactly why the first one is erroring out on cuda:11.0-base (because the contents of /sbin/ldconfig are identical on both cuda:10.0-base and cuda:11.0-base), i.e.:

$ docker run -it nvidia/cuda:10.0-base
root@d774a6075a9e:/# cat /sbin/ldconfig
#!/bin/sh

if  test $# = 0							\
    && test x"$LDCONFIG_NOTRIGGER" = x				\
 && test x"$DPKG_MAINTSCRIPT_PACKAGE" != x			\
 && dpkg-trigger --check-supported 2>/dev/null
then
	if dpkg-trigger --no-await ldconfig; then
		if test x"$LDCONFIG_TRIGGER_DEBUG" != x; then
			echo "ldconfig: wrapper deferring update (trigger activated)"
		fi
		exit 0
	fi
fi

exec /sbin/ldconfig.real "$@"

$ docker run -it nvidia/cuda:11.0-base
root@708e3064cccc:/# cat /sbin/ldconfig
#!/bin/sh

if  test $# = 0							\
    && test x"$LDCONFIG_NOTRIGGER" = x				\
 && test x"$DPKG_MAINTSCRIPT_PACKAGE" != x			\
 && dpkg-trigger --check-supported 2>/dev/null
then
	if dpkg-trigger --no-await ldconfig; then
		if test x"$LDCONFIG_TRIGGER_DEBUG" != x; then
			echo "ldconfig: wrapper deferring update (trigger activated)"
		fi
		exit 0
	fi
fi

exec /sbin/ldconfig.real "$@"

That said, can you double check what your settings for this in /etc/nvidia-container-runtime/config.toml are?

@stephcar75020
Copy link
Author

stephcar75020 commented Dec 8, 2020 via email

@klueska
Copy link
Contributor

klueska commented Mar 22, 2022

The newest version of nvidia-docker should resolve these issues with ldconfig not properly setting up the library search path on debian systems before a container gets launched.

Specifically this change in libnvidia-container fixes the issue and is included as part of the latest release:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141

The latest release packages for the full nvidia-docker stack:

libnvidia-container1-1.9.0
libnvidia-container-tools-1.9.0
nvidia-container-toolkit-1.9.0
nvidia-container-runtime-3.9.0
nvidia-docker-2.10.0

@elezar
Copy link
Member

elezar commented Nov 3, 2023

THis should have been resolved. If not, please open a new issue against https://github.com/NVIDIA/nvidia-container-toolkit

@elezar elezar closed this as completed Nov 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants