Opened 18 years ago
Last modified 15 years ago
#113 reopened Defect
Network connection and No heartbeat message
Reported by: | KSMarksPsych | Owned by: | davea |
---|---|---|---|
Priority: | Minor | Milestone: | Undetermined |
Component: | Client - Daemon | Version: | |
Keywords: | Cc: |
Description
As reported at
http://boinc.berkeley.edu/dev/forum_thread.php?id=1561
The behavior appears with 5.8.x, but not with 5.4.11.
"According to the message log, the repeated exit-and-reset stopped when SETI@Home started asking to connect to the net to fetch work and report results. Einstein@Home had been requesting a connection for a couple of minutes before that. It was probably at about that time that I first "woke up" the computer this morning and dialed in to my ISP. I don't know which of these "state changes" might relate to the exit-and-reset stopping."
"It looks rather like the bogus error-and-reset is associated with unsuccessful attempts of the manager to connect with the net, i.e. when I'm not dialed in. It doesn't *always* happen with a connection attempt, but it *only* appears to be happening when an attempt is made, and at that precise time. The "defer for 1 minute" behaviour seems to be what was causing that 63-second reset."
"Hmm, no. For me, the problem recurred repeatedly all the time that I *wasn't* dialled in, roughly once per minute (since the manager repeated its connection attempt with a 1-minute delay). The problem *stopped* as soon as the connection was established. I don't recall seeing any occurrences of the problem between when I connected and when I logged off."
"Okay, I've been doing this with 5.8.15 for the last week and a half or so. Having to enable/disable the network connectivity manually is indeed an annoyance, but I haven't had that error occur once in that time. It looks like you may have found the culprit."
Change History (17)
comment:1 follow-up: 2 Changed 18 years ago by
comment:2 follow-up: 3 Changed 18 years ago by
Replying to MikeMarsUK:
For 'Nickolas' please read Nicolas.
I feel this is potentially a big bug for the following categories of users:
- Anyone with an iffy connection, since it'll cause the Boinc manager to lock up
- Anyone with dial-up, since if Boinc makes a DNS query during the establishment or disestablishment of their connection, it could freeze
- Anyone running a long-duration work unit, since it makes their work unit vunerable to transient network issues
Not much of an issue for ADSL users running small work units, since it doesn't matter if a few get trashed sometimes.
comment:3 follow-up: 4 Changed 18 years ago by
ARGH!!! Until just now this problem has been purely theoretical for me, but I've just lost three climate models after a network issue (firewall crashed taking down comms, including localhost traffic, for 12 hours). There was approx 2 hours between the firewall crashing and the climate models crashing.
Boinc manager: 5.8.16 Project: Seasonal Attribution Project PC: AMD X2 4600, 2GB RAM OS: XP SP2 Network: ADSL via router ZoneAlarm security suite: 7.0.337.000 Work units: http:**attribution.cpdn.org*result.php?resultid=61997 http:**attribution.cpdn.org*result.php?resultid=61960 http:**attribution.cpdn.org*result.php?resultid=62318
ARGH!!! again, Trac thinks the above links and links in the logs are spam. Can project sites be whitelisted? To fix, replace * with /
vsmon.exe (ZoneAlarm?'s virus checker) crashed at 8:17UTC, while the PC was unattended, and as a result took down ZoneAlarm? itself, blocking comms to the seasonal attribution project servers as well as localhost traffic. Came home 12 hours later, and all 3 of my SAP models on the PC were killed.
SAP doesn't upload it's climate as it goes, so that's a loss of 300 CPU hours as far as the project goes.
vsmon has crashed a few times in the past (once every few months), but it's always been with older versions of the boinc manager (5.4.x) rather than 5.8.x, and it's never affected any work units before.
Approx time that VSMon crashed would have been 9:17 GMT in the log (8:17 UTC), so it was just over two hours before the models crashed.
2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] [task_debug] Process for hadam3h_n_167s1_006b_006b_1_1 exited 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] Task hadam3h_n_167s1_006b_006b_1_1 exited with zero status but no 'finished' file 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] If this happens repeatedly you may need to reset the project. 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] [task_debug] task_state=UNINITIALIZED for hadam3h_n_167s1_006b_006b_1_1 from handle_exited_app 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] [task_debug] Process for hadam3h_a_111s4_2000_2000_1_1 exited 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] Task hadam3h_a_111s4_2000_2000_1_1 exited with zero status but no 'finished' file 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] If this happens repeatedly you may need to reset the project. 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] [task_debug] task_state=UNINITIALIZED for hadam3h_a_111s4_2000_2000_1_1 from handle_exited_app 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] [task_debug] task_state=EXECUTING for hadam3h_n_167s1_006b_006b_1_1 from start 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] Restarting task hadam3h_n_167s1_006b_006b_1_1 using hadam3 version 407 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] [task_debug] task_state=EXECUTING for hadam3h_a_111s4_2000_2000_1_1 from start 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] Restarting task hadam3h_a_111s4_2000_2000_1_1 using hadam3 version 407 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] Scheduler request failed: a timeout was reached 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] Deferring communication for 1 min 0 sec 2007-04-23 11:28:17 [CPDN Seasonal Attribution Project] Reason: scheduler request failed 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] Process for hadam3h_n_167s1_006b_006b_1_1 exited 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] task_state=EXITED for hadam3h_n_167s1_006b_006b_1_1 from handle_exited_app 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Deferring communication for 1 min 0 sec 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Reason: Unrecoverable error for result hadam3h_n_167s1_006b_006b_1_1 ( - exit code -1073741502 (0xc0000142)) 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] result state=COMPUTE_ERROR for hadam3h_n_167s1_006b_006b_1_1 from CS::report_result_error 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] Process for hadam3h_n_167s1_006b_006b_1_1 exited 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] exit code -1073741502 (0xc0000142): 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] Process for hadam3h_a_111s4_2000_2000_1_1 exited 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] task_state=EXITED for hadam3h_a_111s4_2000_2000_1_1 from handle_exited_app 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] result state=COMPUTE_ERROR for hadam3h_a_111s4_2000_2000_1_1 from CS::report_result_error 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] Process for hadam3h_a_111s4_2000_2000_1_1 exited 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] exit code -1073741502 (0xc0000142): 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Computation for task hadam3h_n_167s1_006b_006b_1_1 finished 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Output file hadam3h_n_167s1_006b_006b_1_1_1.zip for task hadam3h_n_167s1_006b_006b_1_1 absent 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Output file hadam3h_n_167s1_006b_006b_1_1_2.zip for task hadam3h_n_167s1_006b_006b_1_1 absent 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Output file hadam3h_n_167s1_006b_006b_1_1_3.zip for task hadam3h_n_167s1_006b_006b_1_1 absent 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Output file hadam3h_n_167s1_006b_006b_1_1_4.zip for task hadam3h_n_167s1_006b_006b_1_1 absent 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Output file hadam3h_n_167s1_006b_006b_1_1_5.zip for task hadam3h_n_167s1_006b_006b_1_1 absent 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] result state=COMPUTE_ERROR for hadam3h_n_167s1_006b_006b_1_1 from CS::app_finished 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Starting hadam3h_n_014s1_003b_003b_2_1 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] [task_debug] task_state=EXECUTING for hadam3h_n_014s1_003b_003b_2_1 from start 2007-04-23 11:28:18 [CPDN Seasonal Attribution Project] Starting task hadam3h_n_014s1_003b_003b_2_1 using hadam3 version 407 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] [task_debug] Process for hadam3h_n_014s1_003b_003b_2_1 exited 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] [task_debug] task_state=EXITED for hadam3h_n_014s1_003b_003b_2_1 from handle_exited_app 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] Reason: Unrecoverable error for result hadam3h_n_014s1_003b_003b_2_1 ( - exit code -1073741502 (0xc0000142)) 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] [task_debug] result state=COMPUTE_ERROR for hadam3h_n_014s1_003b_003b_2_1 from CS::report_result_error 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] [task_debug] Process for hadam3h_n_014s1_003b_003b_2_1 exited 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] [task_debug] exit code -1073741502 (0xc0000142): 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] Computation for task hadam3h_a_111s4_2000_2000_1_1 finished 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] Output file hadam3h_a_111s4_2000_2000_1_1_1.zip for task hadam3h_a_111s4_2000_2000_1_1 absent 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] Output file hadam3h_a_111s4_2000_2000_1_1_2.zip for task hadam3h_a_111s4_2000_2000_1_1 absent 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] Output file hadam3h_a_111s4_2000_2000_1_1_3.zip for task hadam3h_a_111s4_2000_2000_1_1 absent 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] Output file hadam3h_a_111s4_2000_2000_1_1_4.zip for task hadam3h_a_111s4_2000_2000_1_1 absent 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] Output file hadam3h_a_111s4_2000_2000_1_1_5.zip for task hadam3h_a_111s4_2000_2000_1_1 absent 2007-04-23 11:28:19 [CPDN Seasonal Attribution Project] [task_debug] result state=COMPUTE_ERROR for hadam3h_a_111s4_2000_2000_1_1 from CS::app_finished 2007-04-23 11:28:20 [CPDN Seasonal Attribution Project] Computation for task hadam3h_n_014s1_003b_003b_2_1 finished 2007-04-23 11:28:20 [CPDN Seasonal Attribution Project] Output file hadam3h_n_014s1_003b_003b_2_1_1.zip for task hadam3h_n_014s1_003b_003b_2_1 absent 2007-04-23 11:28:20 [CPDN Seasonal Attribution Project] Output file hadam3h_n_014s1_003b_003b_2_1_2.zip for task hadam3h_n_014s1_003b_003b_2_1 absent 2007-04-23 11:28:20 [CPDN Seasonal Attribution Project] Output file hadam3h_n_014s1_003b_003b_2_1_3.zip for task hadam3h_n_014s1_003b_003b_2_1 absent 2007-04-23 11:28:20 [CPDN Seasonal Attribution Project] Output file hadam3h_n_014s1_003b_003b_2_1_4.zip for task hadam3h_n_014s1_003b_003b_2_1 absent 2007-04-23 11:28:20 [CPDN Seasonal Attribution Project] Output file hadam3h_n_014s1_003b_003b_2_1_5.zip for task hadam3h_n_014s1_003b_003b_2_1 absent 2007-04-23 11:28:20 [CPDN Seasonal Attribution Project] [task_debug] result state=COMPUTE_ERROR for hadam3h_n_014s1_003b_003b_2_1 from CS::app_finished _____________________________________________________________________________ 23/04/2007 20:48:26||Starting BOINC client version 5.8.16 for windows_intelx86 23/04/2007 20:48:26||log flags: task, file_xfer, sched_ops, task_debug, sched_op_debug, checkpoint_debug 23/04/2007 20:48:26||Libraries: libcurl/7.16.0 OpenSSL/0.9.8a zlib/1.2.3 23/04/2007 20:48:26||Data directory: E:\Program Files\BOINC 23/04/2007 20:48:27||[task_debug] result state=FILES_UPLOADED for hadam3h_n_167s1_006b_006b_1_1 from RESULT::parse_state 23/04/2007 20:48:27||[task_debug] result state=FILES_UPLOADED for hadam3h_a_111s4_2000_2000_1_1 from RESULT::parse_state 23/04/2007 20:48:27||[task_debug] result state=FILES_UPLOADED for hadam3h_n_014s1_003b_003b_2_1 from RESULT::parse_state 23/04/2007 20:48:27||Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ [x86 Family 15 Model 43 Stepping 1] [fpu tsc pae nx sse sse2 3dnow mmx] 23/04/2007 20:48:27||Memory: 2.00 GB physical, 3.85 GB virtual 23/04/2007 20:48:27||Disk: 71.56 GB total, 41.05 GB free 23/04/2007 20:48:28|CPDN Seasonal Attribution Project|URL: http:**attribution.cpdn.org*; Computer ID: 17357; location: home; project prefs: default 23/04/2007 20:48:28|BBC Climate Change Experiment|URL: http:**bbc.cpdn.org*; Computer ID: 102303; location: home; project prefs: default 23/04/2007 20:48:28|rosetta@home|URL: http:**boinc.bakerlab.org*rosetta*; Computer ID: 138077; location: home; project prefs: default 23/04/2007 20:48:28|Climateprediction.net Beta|URL: http:**climateapps1.oucs.ox.ac.uk*beta*; Computer ID: 118; location: home; project prefs: default 23/04/2007 20:48:28|climateprediction.net|URL: http:**climateprediction.net*; Computer ID: 525881; location: home; project prefs: default 23/04/2007 20:48:28|Zivis|URL: http:**zivis.bifi.unizar.es*; Computer ID: 1086; location: (none); project prefs: default '''Above mangled to try to get round Trac's refusal to include logs in the post''' 23/04/2007 20:48:28||General prefs: from climateprediction.net (last modified 2007-04-21 12:37:49) 23/04/2007 20:48:28||Host location: home 23/04/2007 20:48:28||General prefs: no separate prefs for home; using your defaults 23/04/2007 20:48:28||[sched_op_debug] SCHEDULER_OP::init_op_project(): starting op for http://attribution.cpdn.org/ 23/04/2007 20:48:31|CPDN Seasonal Attribution Project|Sending scheduler request: Requested by user 23/04/2007 20:48:31|CPDN Seasonal Attribution Project|Requesting 17280000 seconds of new work, and reporting 3 completed tasks 23/04/2007 20:48:33|CPDN Seasonal Attribution Project|Scheduler RPC succeeded [server version 505] 23/04/2007 20:48:33||[sched_op_debug] handle_scheduler_reply(): got ack for result hadam3h_n_167s1_006b_006b_1_1 23/04/2007 20:48:33||[sched_op_debug] handle_scheduler_reply(): got ack for result hadam3h_a_111s4_2000_2000_1_1 23/04/2007 20:48:33||[sched_op_debug] handle_scheduler_reply(): got ack for result hadam3h_n_014s1_003b_003b_2_1 23/04/2007 20:48:33|CPDN Seasonal Attribution Project|Deferring communication for 1 min 0 sec 23/04/2007 20:48:33|CPDN Seasonal Attribution Project|Reason: no work from project 23/04/2007 20:49:25||Suspending computation - user request 23/04/2007 20:50:03||Resuming computation 23/04/2007 20:50:03|BBC Climate Change Experiment|Starting hadcm3ohf_cd6l_00977177_1 23/04/2007 20:50:07|BBC Climate Change Experiment|[task_debug] task_state=EXECUTING for hadcm3ohf_cd6l_00977177_1 from start 23/04/2007 20:50:07|BBC Climate Change Experiment|Starting task hadcm3ohf_cd6l_00977177_1 using hadcm3 version 515 23/04/2007 20:50:07||[sched_op_debug] SCHEDULER_OP::init_op_project(): starting op for http:**attribution.cpdn.org* '''mangled''' 23/04/2007 20:50:07|CPDN Seasonal Attribution Project|Sending scheduler request: To fetch work 23/04/2007 20:50:07|CPDN Seasonal Attribution Project|Requesting 4320000 seconds of new work 23/04/2007 20:50:12|CPDN Seasonal Attribution Project|Scheduler RPC succeeded [server version 505] 23/04/2007 20:50:12|CPDN Seasonal Attribution Project|Deferring communication for 1 min 56 sec 23/04/2007 20:50:12|CPDN Seasonal Attribution Project|Reason: no work from project 23/04/2007 20:52:12||[sched_op_debug] SCHEDULER_OP::init_op_project(): starting op for http:**attribution.cpdn.org* '''mangled''' 23/04/2007 20:52:13|CPDN Seasonal Attribution Project|Sending scheduler request: To fetch work 23/04/2007 20:52:13|CPDN Seasonal Attribution Project|Requesting 4320000 seconds of new work 23/04/2007 20:52:18|CPDN Seasonal Attribution Project|Scheduler RPC succeeded [server version 505] 23/04/2007 20:52:18|CPDN Seasonal Attribution Project|Deferring communication for 6 min 1 sec 23/04/2007 20:52:18|CPDN Seasonal Attribution Project|Reason: no work from project 23/04/2007 20:58:23||[sched_op_debug] SCHEDULER_OP::init_op_project(): starting op for http:**attribution.cpdn.org* '''mangled''' 23/04/2007 20:58:23|CPDN Seasonal Attribution Project|Sending scheduler request: To fetch work 23/04/2007 20:58:23|CPDN Seasonal Attribution Project|Requesting 4320000 seconds of new work 23/04/2007 20:58:28|CPDN Seasonal Attribution Project|Scheduler RPC succeeded [server version 505] 23/04/2007 20:58:28|CPDN Seasonal Attribution Project|Deferring communication for 16 min 10 sec 23/04/2007 20:58:28|CPDN Seasonal Attribution Project|Reason: no work from project 23/04/2007 21:03:32|BBC Climate Change Experiment|[task_debug] result hadcm3ohf_cd6l_00977177_1 checkpointed
comment:4 Changed 18 years ago by
Found a very similar log at the following location:
http://www.climateprediction.net/board/viewtopic.php?p=62746#62746
This one shows both a malariacontrol and a climate model crashing simultaneously with an 0xc0000142 exception. The log also shows 'communications deferred 1 minute' type messages, although the user (Haraldo) is not aware of any network issues.
Boinc manager version is 5.8.15.
0xc0000142 is said to be a problem with initialising applications, in particular not being able to load DLLs.
I'm guessing that the DNS issue causes the science application(s) to continually drop out and restart. If there is some sort of resource leak due to the crash then this would only happen a finite number of times, followed by all non-suspended work units on the PC crashing in quick succession?
comment:5 Changed 17 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Possibly fixed in 5.9.12. Please reopen if not. -- David
comment:6 Changed 17 years ago by
As someone who isn't participating in the development process but has had some of the problems resulting from this bug... should I try this "MAY BE UNSTABLE - USE ONLY FOR TESTING" version, or wait for release of a later "recommended" version?
comment:8 Changed 17 years ago by
See also:
http://boinc.berkeley.edu/trac/ticket/282 (don't know the version)
http://boinc.berkeley.edu/trac/ticket/171 (duplicate of 113)
comment:9 Changed 17 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Reopened. Possibly related to #282
2nd part of problem as defined by Nicolas:
- BOINC Manager hangs and doesn't respond to keyboard/mouse if the core client doesn't reply; which could be caused by many reasons, including a hanged client, or a network problem (if the manager is connected to a remote client).
Technical details for the second problem: this is because the manager uses blocking I/O to communicate with the client.
comment:11 follow-up: 12 Changed 17 years ago by
The problem appears to still be in 5.10.28 . I'm still seeing the "Exit 0 status no finished file" message. I've reverted again to 5.4.11 .
comment:12 follow-up: 13 Changed 17 years ago by
Replying to Bunsen:
The problem appears to still be in 5.10.28 . I'm still seeing the "Exit 0 status no finished file" message.
Not necessarily! "Exit 0 status with no finished file" can have many reasons. Problem described in #113 (and related tickets) are bound to DNS problems.
(I was suffering from this issue nearly a year long, often (but not only) when moving running notebook between various networks, but not anymore since.... this sumer? (northern hemisphere))
comment:13 Changed 17 years ago by
Replying to Pepo:
Replying to Bunsen:
The problem appears to still be in 5.10.28 . I'm still seeing the "Exit 0 status no finished file" message.
Not necessarily! "Exit 0 status with no finished file" can have many reasons. Problem described in #113 (and related tickets) are bound to DNS problems.
I'm a chemist, not a software engineer. (Insert a Star Trek joke here if you like.) I don't understand a lot about network operations and such-like, but I'm a reasonably competent observer. I noted a recurring problem with my projects resetting every minute or so while they weren't able to communicate with the servers, and was told that the probable cause was this DNS thing. My response is: "If you say so... can you fix it?" It doesn't ever happen with 5.4.11; it's happened with all the later versions that I've tried.
And now I'm reporting that I'm seeing the same behaviour in 5.10.28: the manager gets into a state such that the projects reset every minute, and make no progress as a result. I'm hoping that the observation might be useful in diagnosing the problem... and of course, perhaps to nudge the problem back into someone's attention. I'd prefer to be using a current version of the manager, but for people like me who don't have a highly-reliable, always-on network connection, this problem is a serious disincentive to doing so.
comment:14 Changed 17 years ago by
As far as I can remember, 5.4.11 had asynchronous DNS, whereas the later versions went to synchronous DNS. That might explain why 5.4.11 works, but the more recent versions fail.
Is there any chance of reverting to async DNS? I feel this problem is more severe than the stale-DNS-cache issue which prompted the move to synchronous DNS in the first place.
Perhaps the timeout on the sync DNS query should be shortened to a few seconds, with it falling back to an aysnc call after timeout.
comment:15 follow-up: 16 Changed 17 years ago by
Replying to KSMarksPsych:
As reported at
I am also having this problem over at ABC:
If you look at computer 27978 You will see the work units 5342497 and 5342483 they are both in error with "No heartbeat from core client for 31 sec - exiting".
These two work units were running on a Linux pc with Boinc version:5.10.21 for Linux 64 bit when the network went down here at home. When it went down BOINC was frozen, probably trying to connect, and the four cores (q6600)were not crunching. I rebooted the PC and when it restarted the ABC work units above were both in error with the no heartbeat error. You can see when my network stops as there are between two and four work units together that are in error.
This is happening on a few machines that are all linux 64 bit.
BOINC freezes up and stops processing work whenever the network shuts down. I have the Activity set to "always available" as it usually is. I have 12 PC's here at home and I do not want to have to manage the network activity manually by "suspending activity" and starting it when it is up again.
I am returning work units to multiple projects that are in error due to this problem.
comment:16 follow-up: 17 Changed 17 years ago by
Replying to Dingo: I have 10 PCs on a network, 3 are 64-bit Linux running BOINC 5.8.17 and the rest are 32-bit Windows running BOINC 5.10.13. In the past week I have had several hundred work units error out with this problem. Furthermore, when the problem develops (usually on one of the Linux machines first, particularly if running ABC at the time) this then brings the entire network down, freezing BOINC on all PCs and requiring me to shut all PCs down and bring them back up one by one. It has taken quite a lot of debugging to find the source of the problem, including having replaced all network components in case of a hardware issue. But it has now been traced back to the BOINC DNS issue. Just 30 seconds without a viable internet connection is enough to start the cascade.
comment:17 Changed 17 years ago by
Replying to Wang Solutions:
Perhaps as a temporary work-around you could install a reliable DNS server on your local network (rather than using your ISP's DNS?). Personally I'm using a router which has DNS included.
This was Nickolas's analysis of the bug:
There is a network-related problem that can cause this.
BOINC recently switched to using synchronous DNS resolving, in an attempt to workaround a DNS cache bug. That means the core client can't do anything while it's waiting for the DNS to respond; it's essentially hanged until it gets a reply. If the DNS server is not replying, for example, if your internet connection has problems, it takes a relatively long time (say 30 seconds) for it to finally give up. During this period, the science app can't communicate with the core client (as the core client is "hanged", it can't reply). It may quit with the error "No heartbeat from core client for 30 seconds, exiting".
When the core client finally gets either a reply from the DNS server, or a timeout, and starts being able to do other things, it notices the science applications had suddenly disappeared. So it gives the error "Task [name] exited with zero status but no 'finished' file. If this happens repeatedly you may need to reset the project." That's the part where the clueless user follows instructions, resets project, and makes the project lose a climate model, all because of a slow or non-working Internet connection!
Another problem this DNS thing causes is unresponsive manager. BOINC Manager has always used blocking I/O for GUI RPCs. That means the BOINC Manager can't do anything while it's waiting for the core client to respond; it's essentially hanged until it gets a reply. If the core client is hanged waiting for DNS, it can't respond to the manager, so the manager can't respond to mouseclicks. It all ends in getting a completely unresponsive GUI, all because of a slow or non-working Internet connection!
Summary: A chain of nasty events. To solve everything I point out on this message, a big lot of fixes would be needed.