Skip to content
This repository was archived by the owner on Apr 24, 2022. It is now read-only.

GPU crash after Ctrl-C on Linux #349

Closed
kkkrackpot opened this issue Oct 16, 2017 · 21 comments
Closed

GPU crash after Ctrl-C on Linux #349

kkkrackpot opened this issue Oct 16, 2017 · 21 comments

Comments

@kkkrackpot
Copy link

kkkrackpot commented Oct 16, 2017

Hi,

I have a 1060 3Gb with the latest ethminer from git.
The GPU is overclocked -200 core offset, +950 RAM offset, power limit 70W, and 65C average temperature.
In general the miner seems to work Ok, but when I stop it with Ctrl-C -- sometimes the whole GPU falls off the bus:

[ 4815.817349] NVRM: GPU at PCI:0000:07:00: GPU-87381000-727a-fe61-c21a-
[ 4815.817351] NVRM: GPU Board Serial Number: 
[ 4815.817352] NVRM: Xid (PCI:0000:07:00): 62, 1d6e(35f8) 00000000 00000000
...
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.12                 Driver Version: 387.12                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   45C    P8     9W / 200W |    163MiB /  3013MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 106...  Off  | 00000000:07:00.0 Off |                  N/A |
|ERR!   51C    P0   ERR! /  70W |     18MiB /  3013MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
               


According to http://docs.nvidia.com/deploy/xid-errors/index.html Xid 62 is an internal microcontroller halt.

I wonder if it's hardware's problem, or the driver's, or the miner's, or something else?

It seems related to overclocking, but I'm not sure.

Regards,
Alex

@bmatthewshea
Copy link
Contributor

Does it happen when NOT overclocked? Need to troubleshoot/narrow it down as you mentioned..

@mnik247
Copy link

mnik247 commented Oct 17, 2017

I have same result for Ctrl-C with overclocked 1060 3GB (Ubuntu 16).
I fixed this only one way: before "Ctrl-C" I always reset memory speed to normal.
And after start and load DAG restore memory speed to MAX.

@smurfy
Copy link
Collaborator

smurfy commented Oct 17, 2017

Probably related to #305 + #331

Tested my fix (#331) for a while now on windows, did not have any exit related graphics resets.

@fhlfibh is your error reproduceable or did it only happen once on linux?

@kkkrackpot
Copy link
Author

kkkrackpot commented Oct 17, 2017

Was away, sorry.
@smurfy Yes, it is reproduceable for me with that very card... I also have an another overclocked card that doesn't show such problem so far. The model is identical (though i am not sure about memory manufacturer), but it works on a headless server with different chipset. I will try to swap the cards later.
PS. That another card is overclocked -200/+1200 70W. Both are Gigabyte Mini.
UPD. Right now it happened even without ctrl-c. I just started nvidia-smi in another terminal -- and bang: the same Xid 62 on the problem card...

@dennis97519
Copy link

Is it possible to add some command inside the program to shut down the process gracefully? I am also experiencing the problem on Windows 7 with 1080 Ti

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Dec 30, 2017

@fhlfibh Those are high overclocks for 1060s. I experience similar lockups when I push the clocks. Sometimes it will appear to be working great and a couple of hours later it'll freeze. Not sure the -200 GPU clock delta does anything... doesn't seem to affect power or performance on my 1060s. I overclock the memory transfer by a modest 404. I value reliability over a few percent extra performance.

@bmatthewshea
Copy link
Contributor

@dennis97519 yeah it would be nice, or simply have binary 'watch' for ctrl-c. Even ctrl-c can gracefully exit a process - but you have to watch for it. I know a few other miners allow a quit etc. to gracefully exit+cleanup.

@EoD
Copy link
Contributor

EoD commented Jan 2, 2018

@dennis97519 @bmatthewshea What exactly would you expect on a "graceful" exit?

@dennis97519
Copy link

@EoD No "Graphics driver crashed and recovered" message when exiting.

@EoD
Copy link
Contributor

EoD commented Jan 2, 2018

@dennis97519 Well, this sounds more like a driver issue than an issue with ethminer. I am not sure how ethminer can do a "graceful" exit when the driver is actually the one crashing.

@bmatthewshea
Copy link
Contributor

bmatthewshea commented Jan 4, 2018

@EoD The driver isn't crashing arbitrarily, obviously. It's crashing when exiting ethminer. Hence it was reported here. I have had this happen when not over-clocked at all. Doesn't seem to matter. Doesn't happen all the time.
Other miners I've used do not crash when exiting. 'stable' OC or not.
^ Is driver 'okay' in this case because it doesn't crash? No, it's same driver, obviously.. It's about the binary running.
I reported here because I thought it may be the way ethminer exits? I don't know. Why it's an issue/trying to help.
I'm just reporting what I see: "Driver crashed. It has recovered." in tray (rarely.. & just cuda for me on a 1060) AMD fine on exits.
To answer your question above:
I would expect a "graceful exit" to not crash the driver on exit.

@EoD
Copy link
Contributor

EoD commented Jan 6, 2018

@bmatthewshea I am unable to reproduce the crash at all and never experienced it, hence I assume this crash is Nvidia-only. And this would indicate even more that it is a driver problem.

Mitigating driver issues in a userland program (like ethminer) might be possible, but our "fix" is then just a workaround for a driver bug and never a proper fix.

Did anyone of you try reporting this upstream to Nvidia?

@bmatthewshea
Copy link
Contributor

bmatthewshea commented Jan 10, 2018

@EoD Understood & no I have not reported the driver crash upstream as it has never happened with anything but ethminer. As I said, it's hit and miss at that & not a huge issue. I -do- use the machine w/ nv 1060 occasionally when it's mining (web browser / etc , but nothing gpu intensive) whereas others who don't see it may be solely mining at all times. Maybe that is a factor. Maybe not..
I do know it's done it on two different machines and two different driver versions through time. (both Win7x64-current / 'full' prod. nv driver installed)

@EoD
Copy link
Contributor

EoD commented Jan 10, 2018

The demand (requirements) on drivers are of course higher if they are under heavy load and especially if they are switching from low->high and high->low load. Hence I recommend report it upstream.

@inprosys
Copy link

So, let's see: The CTRL-C crashing the NVIDIA drivers must be an NVIDIA driver problem when it NEVER occurs using Claymore and it ALWAYS occurs when I try to close ethminer on a system with more than two GPU cards installed.

It seems to be a strange stance for @EoD to take when there seems to be at least one fix (threads #305 and #331) that appears to be able to correct the CTRL-C problem.

Is @EoD testing this on more than one system? More than one video card? I think that what user
@bmatthewshea is saying -- and what I'm joining in on -- is that it doesn't seem to be asking a lot for ethminer to ensure that all video processing is halted before it terminates and exits, If all GPU processing is halted, there would be no argument about P0 state versus P2 state and overclocking, etc. All the GPU cards would settle back down to minimum states and a program could gracefully exit. That would be nice.

@jean-m-cyr
Copy link
Contributor

jean-m-cyr commented Jan 24, 2018

@inprosys Windows or Linux? Ah, never mind... Windows only it seems.

@EoD
Copy link
Contributor

EoD commented Jan 25, 2018

@inprosys yes, of course. Two completely different systems with two different kind of cards (both AMD, but different generations and different drivers) and on both systems both Linux and Windows. The issue never happened in any configuration.

I am not against the idea of #331, the idea is good in general. I just wanted to point out that we are working around a driver issue. As I already tried to say above, working around a driver issue might just be a temporary fix and not a permanent fix.

@inprosys
Copy link

Yes, @jean-m-cyr, it is Windows in my case.

OK, @EoD, I can see how when you have only tested with AMD, and cannot recreate the problem, that it seems as if the CTRL-C problem can be dismissed as being strictly an NVIDIA driver problem. But, it could also be improper program cleanup (loose ends, loose threads, memory leaks, hanging semaphores, etc.) before termination that AMD happens to ignore. (One man's bug is another man's feature.)

All I'm asking is "Is ethminer doing proper cleanup before exiting?" Should the NVIDIA device driver be immune to all possible program abuses? Maybe. But, given what a pain-in-the-ass it is to recover GPU cards disappearing off the PCIe bus, it would be great to avoid this problem if all it took was careful programming exit procedures -- that's not a work-around -- it's good programming standards.

Just for the record, I really appreciate all the time and effort that people are contributing to support this project.

@bmatthewshea
Copy link
Contributor

Maybe just good luck so far, but haven't been noticing it as much/at all? on last few dev builds and 13 release.

@DeadManWalkingTO
Copy link
Contributor

I think this issue can be closed.

@smurfy
Copy link
Collaborator

smurfy commented Feb 19, 2018

Should be fixed or at least improved by #331

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants