Opened 11 years ago

Last modified 11 years ago

#1236 new Defect

Upgrading during NVIDIA tasks crashes driver and interferes with hardware detection‏

Reported by: JacobKlein Owned by: romw
Priority: Undetermined Milestone: Undetermined
Component: Client - Setup Version: 7.0.28
Keywords: Upgrade NVIDIA driver stopped responding recovered Cc: Jacob_W_Klein@…

Description

From: jacob_w_klein@…
To: boinc_alpha@…
Subject: Upgrading during NVIDIA tasks crashes driver and interferes with hardware detection
Date: Tue, 2 Apr 2013 13:48:41 -0400

When I perform an upgrade (say from 7.0.59 x64, to 7.0.60 x64), while the older version is running...
If there are NVIDIA tasks running (I usually have 4 tasks across 2 GPUs), the driver crashes 4 times.

The error is a balloon in the Windows system tray that says:
Display driver stopped responding and has recovered
Display driver NVIDIA Windows Kernel Mode Driver, Version 314.22 stopped
responding and has successfully recovered.

Each of the balloons takes something like 4 seconds to show and fade, and sometimes Windows flickers a bit while this happens.
But the installation wizard is already on the page saying "Launch the BOINC Manager", with the Finish button available.

If I click that, while the drivers are crashing/recovering, the end result is that a GPU is not detected properly by OpenCL detection in the new version.

Normally, the my detection sequence looks like:
4/2/2013 1:18:57 PM |  | CUDA: NVIDIA GPU 0: GeForce GTX 660 Ti (driver version 314.22, CUDA version 5.0, compute capability 3.0, 3072MB, 2859MB available, 3021 GFLOPS peak)
4/2/2013 1:18:57 PM |  | CUDA: NVIDIA GPU 1: GeForce GTX 460 (driver version 314.22, CUDA version 5.0, compute capability 2.1, 1024MB, 951MB available, 1025 GFLOPS peak)
4/2/2013 1:18:57 PM |  | OpenCL: NVIDIA GPU 0: GeForce GTX 660 Ti (driver version 314.22, device version OpenCL 1.1 CUDA, 3072MB, 2859MB available, 3021 GFLOPS peak)
4/2/2013 1:18:57 PM |  | OpenCL: NVIDIA GPU 1: GeForce GTX 460 (driver version 314.22, device version OpenCL 1.1 CUDA, 1024MB, 951MB available, 1025 GFLOPS peak)
4/2/2013 1:18:57 PM | Poem@Home | Found app_config.xml
4/2/2013 1:18:57 PM | GPUGRID | Found app_config.xml
4/2/2013 1:18:57 PM | World Community Grid | Found app_config.xml
4/2/2013 1:18:57 PM |  | Config: use all coprocessors
4/2/2013 1:18:57 PM | World Community Grid | Config: excluded GPU.  Type: all.  App: hcc1.  Device: 0
4/2/2013 1:18:57 PM | Poem@Home | Config: excluded GPU.  Type: all.  App: poemcl.  Device: 1

But, if I click "Finish" with "Launch the BOINC Manager" checked, while the drivers are crashing/recovering, I get:
4/2/2013 1:24:32 PM |  | CUDA: NVIDIA GPU 0: GeForce GTX 660 Ti (driver version 314.22, CUDA version 5.0, compute capability 3.0, 3072MB, 2775MB available, 3021 GFLOPS peak)
4/2/2013 1:24:32 PM |  | CUDA: NVIDIA GPU 1: GeForce GTX 460 (driver version 314.22, CUDA version 5.0, compute capability 2.1, 1024MB, 1024MB available, 1025 GFLOPS peak)
4/2/2013 1:24:32 PM |  | OpenCL: NVIDIA GPU 0: GeForce GTX 660 Ti (driver version 314.22, device version OpenCL 1.1 CUDA, 3072MB, 2775MB available, 3021 GFLOPS peak)
4/2/2013 1:24:32 PM | Poem@Home | Found app_config.xml
4/2/2013 1:24:32 PM | GPUGRID | Found app_config.xml
4/2/2013 1:24:32 PM | World Community Grid | Found app_config.xml
4/2/2013 1:24:32 PM |  | Config: use all coprocessors
4/2/2013 1:24:32 PM | World Community Grid | Config: excluded GPU.  Type: all.  App: hcc1.  Device: 0
4/2/2013 1:24:32 PM | Poem@Home | Config: excluded GPU.  Type: all.  App: poemcl.  Device: 1

Notice that one of the GPUs was not properly detected for OpenCL.

Now, normally when I exit BOINC (by right-clicking it in the System Tray, choosing Exit, and making sure the "Stop running tasks" box is checked)
... normally it closes just fine without any driver crashes.
Closing it this way, closes the Manager and the Client nicely.

So why is the installer not closing things nicely? Is it somehow closing things differently then normal?
We shouldn't see driver crashes when performing an upgrade, especially if they interfere with the new version's hardware detection.

Any ideas?

Note: I tested an upgrade from 7.0.27 to 7.0.28 (using the 7.0.28 installer), and it too exhibited driver-crashing behavior.
So, should I create a ticket for this?

Thanks,
Jacob

Change History (3)

comment:1 Changed 11 years ago by JacobKlein

Richard also added this:

Date: Tue, 2 Apr 2013 19:16:47 +0100
From: r.haselgrove@…
Subject: Re: [boinc_alpha] Upgrading during NVIDIA tasks crashes driver and interferes with hardware detection
To: jacob_w_klein@…; boinc_alpha@…

It seems to me that there are two ways of closing the BOINC client/manager/tray local system: let's call them 'manual' and 'automatic'.

'Manual' can be invoked by right-click|exit from the system tray icon (Manager with main window closed), or by File|Exit BOINC from the main Manager window. Either way displays the Exit Conformation dialog, if enabled.

'Automatic' is invoked when either the BOINC installer, or a Windows shutdown, needs to close BOINC. This closedown mode never displays the Exit confirmation dialog.

I have another couple of gripes with 'automatic' closedown mode:

1) Pending registry changes (e.g. 'Run Manager at login?') aren't flushed during an automatic closedown, only by a manual close.
2) Some applications have a tendency to cause running tasks to error if closed by the 'automatic' route, but not if closed by the 'manual' route. From my current project set, NumberFieldsatHome seem vulnerable to this. CPDN moderators have long advised volunteers running CPDN tasks to shut down BOINC manually before shutting down their computers.

I suspect the NVIDIA driver restart issue arises similarly from a difference between manual and automatic shutdown of BOINC. Whether BOINC is interacting directly with the driver at this point, or whether BOINC closes the applications in an unusual way and they in turn trigger the driver problem, needs further investigation. 

Last edited 11 years ago by JacobKlein (previous) (diff)

comment:2 Changed 11 years ago by JacobKlein

Rom added:

Date: Wed, 3 Apr 2013 00:44:44 -0400
> From: romw@…
> To: mavilar@…; r.haselgrove@…
> CC: boinc_alpha@…
> Subject: Re: [boinc_alpha] Upgrading during NVIDIA tasks crashes driver and interferes with hardware detection
>
> This breaks out into four different categories:
> 1. Setup - The installer just forcefully kills the BOINC related
> processes and lets the science applications shutdown with the no
> heartbeat error. This should allow the installer to update the BOINC
> files without requiring a reboot.
>
> 2. Shutdown/Reboot? - We should be handling this case correctly. The
> BOINC client should gracefully shutdown after shutting down the science
> applications. If it isn't working correctly we should investigate and
> fix if we can.
>
> There might be once case where we might not be able to do anything
> about. Starting with Vista, Microsoft set something of a hard limit on
> how long a process can take to shutdown. If it doesn't shutdown fast
> enough Windows just kills it. IIRC, there was a registry setting to
> control the timeout, but I think they were planning on removing it for
> some future version of Windows.
>
> 3. Manager Exit - gracefully exits after save state.
>
> 4. Saving State - We have plans for a better way to save state. It
> won't make it in before the public release though.
>
> ----- Rom

comment:3 Changed 11 years ago by JacobKlein

I then replied:

From: jacob_w_klein@…
To: romw@…; mavilar@…; r.haselgrove@…
CC: boinc_alpha@…
Subject: RE: [boinc_alpha] Upgrading during NVIDIA tasks crashes driver and interferes with hardware detection
Date: Wed, 3 Apr 2013 06:43:33 -0400

Just to be clear, I am requesting that we look at ways to improve category (1), Setup.
Essentially, I am requesting that the "Setup shutdown procedure" be changed to be the same as the "Manager Exit shutdown procedure"
And I created Defect #1236 in Trac, for this request.
http://boinc.berkeley.edu/trac/ticket/1236

Thanks,
Jacob

Note: See TracTickets for help on using tickets.