Opened 17 years ago

Closed 11 years ago

#336 closed Defect (fixed)

replace heartbeat mechanism

Reported by: davea Owned by: davea
Priority: Critical Milestone: Undetermined
Component: BOINC - API Version:
Keywords: Cc: Pepo

Description

Problem with the heartbeat mechanism: if the client does something that blocks for > 30 secs (e.g. a synchronous DNS lookup, a disk-space scan, a debugger break) then all apps quit, producing confusing messages and possibly wasting CPU time.

Proposed solution: remove heartbeat mechanism. Include client process ID in the app_init_data file. The API periodically sees if that process is still alive, and exits if not.

Change History (12)

comment:1 Changed 17 years ago by Nicolas

Disagreed. What if the app has a strange deadlock and CPU time is being wasted, but the process is still alive?

The actual "bug" is that the client should NEVER block for > 30 seconds.

Lack of heartbeats causing apps to quit is only one of the many problems that show up (it's just a symptom). For example, the manager also blocks waiting for RPCs. That means if the client blocks doing something, the manager will hang too (= unresponsive GUI, bad for the user; this is another symptom caused by two or three different problems).

The client shouldn't stop responding RPCs because it's doing something else. The manager shouldn't stop responding user input because it's waiting for the RPC.

comment:2 Changed 16 years ago by Didactylos

How about improving the heartbeat mechanism? I haven't studied it in depth, but two thoughts come to mind.

  1. Blocking functions in the client or app should be asynchronous.
  2. Would it be possible perhaps to temporarily suspend the heartbeat for an app when it is about to block? Either for a set period or until it stops blocking? There are dangers with this, but the messages would be a lot more informative.

comment:3 in reply to:  description ; Changed 16 years ago by Ananas

Replying to davea:

... Proposed solution: remove heartbeat mechanism. ...

If the heartbeat would be redefined to be expected within 30 CPU seconds instead of 30 seconds, heartbeats would be expected less often when the host itself is unresponsive (7-zip a huge file with max. compression has that effect). As high load affects both core client and project application, using the CPU time would probably be more appropriate. The project application would expect less heartbeats when it gets less CPU time itself.

The process CPU time should (hopefully) not be influenced by adjusting the PC clock.

There should not even be any compatibility issues as elapsed CPU time and elapsed wallclock time are not too different most of the time.

So the (unmodified) core clients still try to send a heartbeat within at least 30 seconds but the API would be more patient on overloaded systems.

That might fix the heartbeat problem when adjusting the PC time too

comment:4 Changed 16 years ago by duanra

This ticket well deserves its critical priority, because it is a cross-project problem. It happens with some WCG apps and boincsimap, for example. Each time my computer comes out of hibernation mode, it says :

Task 8011101.024330_1 exited with a DLL initialization error. If this happens repeatedly you may need to reboot your computer.

(No effect on rebooting, of course)

But in fact, in the stderr.txt file : No heartbeat from core client for 31 sec - exiting

comment:5 Changed 16 years ago by Pepo

Cc: Pepo added

comment:6 in reply to:  3 ; Changed 16 years ago by Nicolas

Replying to Ananas:

If the heartbeat would be redefined to be expected within 30 CPU seconds instead of 30 seconds, heartbeats would be expected less often when the host itself is unresponsive (7-zip a huge file with max. compression has that effect). As high load affects both core client and project application, using the CPU time would probably be more appropriate. The project application would expect less heartbeats when it gets less CPU time itself.

Heartbeats in the science app are handled by a separate thread that is not low-priority, so other computer load shouldn't affect it.

comment:7 in reply to:  6 Changed 16 years ago by Pepo

Replying to Nicolas:

Replying to Ananas:

As high load affects both core client and project application, using the CPU time would probably be more appropriate. The project application would expect less heartbeats when it gets less CPU time itself.

Heartbeats in the science app are handled by a separate thread that is not low-priority, so other computer load shouldn't affect it.

From my experiences upon resuming from hibernation, the apps often seem to be heavily crunching before deciding to disappear, so the switch from wall clock to CPU time would not change the behavior.

Maybe the client will finally also get a separate heartbeat thread, that is not low-priority...

comment:8 in reply to:  1 Changed 16 years ago by jbk

Replying to Nicolas:

Disagreed. What if the app has a strange deadlock and CPU time is being wasted, but the process is still alive?

The proposal is to include the process ID of the core client so that the science app can check it every now and then. The situation where the science app enters a deadlock is irrelevant to both the current and proposed system. It's true that the core client could enter a busy-waiting state and keep science apps artificially running. But in that case we've pretty much failed anyways, so it probably doesn't matter.

To keep track of runaway science apps you could add an additional system that checks on the reports coming from science apps:

  • You could limit how long an app is allowed to run without producing any progress reports
  • You could limit how long an app is allowed to run without producing any forward progress
  • Both of the above limits could be made on both CPU-time and wall-time (with tests for time-skips caused by hibernation or summertime/wintertime kicking in).
  • The limits could be part of the workunit XML to allow projects to configure them for their apps.

comment:9 Changed 15 years ago by romw

Milestone: 6.6Undetermined

comment:10 Changed 12 years ago by Christian Beer

I want to revive this ticket to be discussed at the 2012 BOINC workshop in London. At the Rechenkraft forum we also discussed switching to the process ID solution. At the moment we turned heartbeat off for some applications because it produces lots of errors due to big files and zipping/unzipping them.

Maybe we can find a small group at the hackfest that can build a working replacement for the current heartbeat mechanism based on this proposal.

comment:11 Changed 12 years ago by Nicolas

Unzipping large files shouldn't block the client to begin with...

From what I saw in previous comments, handling the case where the core client hangs (eg. enters an infinite loop) is not a goal, or at least it's not solved by any of the proposals so far. So the problem is making science apps quit if the core client quits.

What I would do is open a pipe between the client and the science app. If the client dies, the science app can immediately notice the pipe was closed on the other end and quit too. It would be great to do client-app communication over pipes or sockets instead of the current shared memory mess, but that's a much bigger change, and independent to this issue. To detect client death, it's not even necessary to write anything into the pipe.

comment:12 Changed 11 years ago by davea

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.