GPU crash after Ctrl-C on Linux #349

kkkrackpot · 2017-10-16T22:05:30Z

Hi,

I have a 1060 3Gb with the latest ethminer from git.
The GPU is overclocked -200 core offset, +950 RAM offset, power limit 70W, and 65C average temperature.
In general the miner seems to work Ok, but when I stop it with Ctrl-C -- sometimes the whole GPU falls off the bus:

[ 4815.817349] NVRM: GPU at PCI:0000:07:00: GPU-87381000-727a-fe61-c21a-
[ 4815.817351] NVRM: GPU Board Serial Number: 
[ 4815.817352] NVRM: Xid (PCI:0000:07:00): 62, 1d6e(35f8) 00000000 00000000
...
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.12                 Driver Version: 387.12                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   45C    P8     9W / 200W |    163MiB /  3013MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 106...  Off  | 00000000:07:00.0 Off |                  N/A |
|ERR!   51C    P0   ERR! /  70W |     18MiB /  3013MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+

According to http://docs.nvidia.com/deploy/xid-errors/index.html Xid 62 is an internal microcontroller halt.

I wonder if it's hardware's problem, or the driver's, or the miner's, or something else?

It seems related to overclocking, but I'm not sure.

Regards,
Alex

The text was updated successfully, but these errors were encountered:

bmatthewshea · 2017-10-17T14:54:54Z

Does it happen when NOT overclocked? Need to troubleshoot/narrow it down as you mentioned..

mnik247 · 2017-10-17T15:07:15Z

I have same result for Ctrl-C with overclocked 1060 3GB (Ubuntu 16).
I fixed this only one way: before "Ctrl-C" I always reset memory speed to normal.
And after start and load DAG restore memory speed to MAX.

smurfy · 2017-10-17T15:50:19Z

Probably related to #305 + #331

Tested my fix (#331) for a while now on windows, did not have any exit related graphics resets.

@fhlfibh is your error reproduceable or did it only happen once on linux?

kkkrackpot · 2017-10-17T19:46:59Z

Was away, sorry.
@smurfy Yes, it is reproduceable for me with that very card... I also have an another overclocked card that doesn't show such problem so far. The model is identical (though i am not sure about memory manufacturer), but it works on a headless server with different chipset. I will try to swap the cards later.
PS. That another card is overclocked -200/+1200 70W. Both are Gigabyte Mini.
UPD. Right now it happened even without ctrl-c. I just started nvidia-smi in another terminal -- and bang: the same Xid 62 on the problem card...

dennis97519 · 2017-12-30T04:45:49Z

Is it possible to add some command inside the program to shut down the process gracefully? I am also experiencing the problem on Windows 7 with 1080 Ti

jean-m-cyr · 2017-12-30T06:31:36Z

@fhlfibh Those are high overclocks for 1060s. I experience similar lockups when I push the clocks. Sometimes it will appear to be working great and a couple of hours later it'll freeze. Not sure the -200 GPU clock delta does anything... doesn't seem to affect power or performance on my 1060s. I overclock the memory transfer by a modest 404. I value reliability over a few percent extra performance.

bmatthewshea · 2017-12-30T15:05:52Z

@dennis97519 yeah it would be nice, or simply have binary 'watch' for ctrl-c. Even ctrl-c can gracefully exit a process - but you have to watch for it. I know a few other miners allow a quit etc. to gracefully exit+cleanup.

EoD · 2018-01-02T00:43:55Z

@dennis97519 @bmatthewshea What exactly would you expect on a "graceful" exit?

dennis97519 · 2018-01-02T12:52:35Z

@EoD No "Graphics driver crashed and recovered" message when exiting.

EoD · 2018-01-02T13:22:17Z

@dennis97519 Well, this sounds more like a driver issue than an issue with ethminer. I am not sure how ethminer can do a "graceful" exit when the driver is actually the one crashing.

bmatthewshea · 2018-01-04T01:12:55Z

@EoD The driver isn't crashing arbitrarily, obviously. It's crashing when exiting ethminer. Hence it was reported here. I have had this happen when not over-clocked at all. Doesn't seem to matter. Doesn't happen all the time.
Other miners I've used do not crash when exiting. 'stable' OC or not.
^ Is driver 'okay' in this case because it doesn't crash? No, it's same driver, obviously.. It's about the binary running.
I reported here because I thought it may be the way ethminer exits? I don't know. Why it's an issue/trying to help.
I'm just reporting what I see: "Driver crashed. It has recovered." in tray (rarely.. & just cuda for me on a 1060) AMD fine on exits.
To answer your question above:
I would expect a "graceful exit" to not crash the driver on exit.

EoD · 2018-01-06T15:31:08Z

@bmatthewshea I am unable to reproduce the crash at all and never experienced it, hence I assume this crash is Nvidia-only. And this would indicate even more that it is a driver problem.

Mitigating driver issues in a userland program (like ethminer) might be possible, but our "fix" is then just a workaround for a driver bug and never a proper fix.

Did anyone of you try reporting this upstream to Nvidia?

bmatthewshea · 2018-01-10T04:02:38Z

@EoD Understood & no I have not reported the driver crash upstream as it has never happened with anything but ethminer. As I said, it's hit and miss at that & not a huge issue. I -do- use the machine w/ nv 1060 occasionally when it's mining (web browser / etc , but nothing gpu intensive) whereas others who don't see it may be solely mining at all times. Maybe that is a factor. Maybe not..
I do know it's done it on two different machines and two different driver versions through time. (both Win7x64-current / 'full' prod. nv driver installed)

EoD · 2018-01-10T11:07:40Z

The demand (requirements) on drivers are of course higher if they are under heavy load and especially if they are switching from low->high and high->low load. Hence I recommend report it upstream.

inprosys · 2018-01-24T23:37:30Z

So, let's see: The CTRL-C crashing the NVIDIA drivers must be an NVIDIA driver problem when it NEVER occurs using Claymore and it ALWAYS occurs when I try to close ethminer on a system with more than two GPU cards installed.

It seems to be a strange stance for @EoD to take when there seems to be at least one fix (threads #305 and #331) that appears to be able to correct the CTRL-C problem.

Is @EoD testing this on more than one system? More than one video card? I think that what user
@bmatthewshea is saying -- and what I'm joining in on -- is that it doesn't seem to be asking a lot for ethminer to ensure that all video processing is halted before it terminates and exits, If all GPU processing is halted, there would be no argument about P0 state versus P2 state and overclocking, etc. All the GPU cards would settle back down to minimum states and a program could gracefully exit. That would be nice.

jean-m-cyr · 2018-01-24T23:45:36Z

@inprosys Windows or Linux? Ah, never mind... Windows only it seems.

EoD · 2018-01-25T00:27:56Z

@inprosys yes, of course. Two completely different systems with two different kind of cards (both AMD, but different generations and different drivers) and on both systems both Linux and Windows. The issue never happened in any configuration.

I am not against the idea of #331, the idea is good in general. I just wanted to point out that we are working around a driver issue. As I already tried to say above, working around a driver issue might just be a temporary fix and not a permanent fix.

inprosys · 2018-01-25T04:01:35Z

Yes, @jean-m-cyr, it is Windows in my case.

OK, @EoD, I can see how when you have only tested with AMD, and cannot recreate the problem, that it seems as if the CTRL-C problem can be dismissed as being strictly an NVIDIA driver problem. But, it could also be improper program cleanup (loose ends, loose threads, memory leaks, hanging semaphores, etc.) before termination that AMD happens to ignore. (One man's bug is another man's feature.)

All I'm asking is "Is ethminer doing proper cleanup before exiting?" Should the NVIDIA device driver be immune to all possible program abuses? Maybe. But, given what a pain-in-the-ass it is to recover GPU cards disappearing off the PCIe bus, it would be great to avoid this problem if all it took was careful programming exit procedures -- that's not a work-around -- it's good programming standards.

Just for the record, I really appreciate all the time and effort that people are contributing to support this project.

bmatthewshea · 2018-01-25T19:41:25Z

Maybe just good luck so far, but haven't been noticing it as much/at all? on last few dev builds and 13 release.

DeadManWalkingTO · 2018-02-19T02:26:07Z

I think this issue can be closed.

smurfy · 2018-02-19T13:08:54Z

Should be fixed or at least improved by #331

smurfy mentioned this issue Feb 5, 2018

Clean shutdown on ctrl+c or kill #331

Merged

kkkrackpot closed this as completed Feb 19, 2018

DeadManWalkingTO mentioned this issue Feb 19, 2018

Issues that can be closed (cleanup) #764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU crash after Ctrl-C on Linux #349

GPU crash after Ctrl-C on Linux #349

kkkrackpot commented Oct 16, 2017 •

edited

Loading

bmatthewshea commented Oct 17, 2017

mnik247 commented Oct 17, 2017

smurfy commented Oct 17, 2017

kkkrackpot commented Oct 17, 2017 •

edited

Loading

dennis97519 commented Dec 30, 2017

jean-m-cyr commented Dec 30, 2017 •

edited

Loading

bmatthewshea commented Dec 30, 2017

EoD commented Jan 2, 2018

dennis97519 commented Jan 2, 2018

EoD commented Jan 2, 2018

bmatthewshea commented Jan 4, 2018 •

edited

Loading

EoD commented Jan 6, 2018

bmatthewshea commented Jan 10, 2018 •

edited

Loading

EoD commented Jan 10, 2018

inprosys commented Jan 24, 2018

jean-m-cyr commented Jan 24, 2018 •

edited

Loading

EoD commented Jan 25, 2018

inprosys commented Jan 25, 2018

bmatthewshea commented Jan 25, 2018

DeadManWalkingTO commented Feb 19, 2018

smurfy commented Feb 19, 2018

GPU crash after Ctrl-C on Linux #349

GPU crash after Ctrl-C on Linux #349

Comments

kkkrackpot commented Oct 16, 2017 • edited Loading

bmatthewshea commented Oct 17, 2017

mnik247 commented Oct 17, 2017

smurfy commented Oct 17, 2017

kkkrackpot commented Oct 17, 2017 • edited Loading

dennis97519 commented Dec 30, 2017

jean-m-cyr commented Dec 30, 2017 • edited Loading

bmatthewshea commented Dec 30, 2017

EoD commented Jan 2, 2018

dennis97519 commented Jan 2, 2018

EoD commented Jan 2, 2018

bmatthewshea commented Jan 4, 2018 • edited Loading

EoD commented Jan 6, 2018

bmatthewshea commented Jan 10, 2018 • edited Loading

EoD commented Jan 10, 2018

inprosys commented Jan 24, 2018

jean-m-cyr commented Jan 24, 2018 • edited Loading

EoD commented Jan 25, 2018

inprosys commented Jan 25, 2018

bmatthewshea commented Jan 25, 2018

DeadManWalkingTO commented Feb 19, 2018

smurfy commented Feb 19, 2018

kkkrackpot commented Oct 16, 2017 •

edited

Loading

kkkrackpot commented Oct 17, 2017 •

edited

Loading

jean-m-cyr commented Dec 30, 2017 •

edited

Loading

bmatthewshea commented Jan 4, 2018 •

edited

Loading

bmatthewshea commented Jan 10, 2018 •

edited

Loading

jean-m-cyr commented Jan 24, 2018 •

edited

Loading