Opened 15 years ago

Last modified 15 years ago

#903 new Defect

BOINC Manager Freeze

Reported by: The Gas Giant Owned by: romw
Priority: Minor Milestone: Undetermined
Component: Manager Version: 6.6.28
Keywords: Freeze Cc:

Description

I’m seeing my computer (Q9450, WIN XP x86, 2GB RAM) totally freeze up for a minute or so when BOINC is trying to do a network connect, the computer thinks there is a network available but the router has actually frozen. This has occurred for nearly all BOINC revisions that I can remember – even 6.6.28. Most of the time it’s not a problem but recently my router has been flakey (or ISP) and it’s becoming a major annoyance. The computer will come back briefly then freeze again as soon as BOINC tries to do another connect. The “communicating with client please wait” window may or not come up when this occurs. If BOINC is not running I do not see this problem occur.

Change History (4)

comment:1 Changed 15 years ago by Nicolas

This is actually three different problems that, together, cause the freeze. I'm sure they are reported already. But I think I will mark all of them as duplicate and open my own tickets explaining the underlying causes...

comment:2 Changed 15 years ago by romw

Owner: changed from romw to charlief

Seems like more Async-RPC issues.

comment:3 Changed 15 years ago by charlief

Please clarify: the title says "BOINC Manager Freeze" but your description says that your computer totally freezes. Is it the entire computer, or just BOINC Manager?

Does trying to connect from another application (such as Internet Explorer) also cause the computer to freeze, or is it only BOINC?

When the "Communicating with client please wait” window does _not_ occur, does BOINC respond to menu commands such as Advanced / options?

comment:4 Changed 15 years ago by charlief

Owner: changed from charlief to romw

I asked Nicolas for more info, and he wrote:

The client may hang for many reasons, sometimes briefly, sometimes for as long as a minute. Doing a DNS request (if libcurl isn't compiled with async name lookups and the DNS server is down), cleaning a slot with lots of files when a workunit finishes, calculating disk usage when there are lots of files in project or slot directories, checking the MD5 of a giant file, copying a giant file from project to slot or vice-versa (if there is <copy_file/>), etc.

Those are possible causes for client hangs. Now for the consequences:

The manager used to hang immediately after the client hanged, because it sent GUI RPC requests to the client and then *blocked* waiting for a reply. Async GUI RPCs may have fixed this particular problem.

But there is another consequence that remains: if the client is hanged for more than 30 seconds, science apps will think the client quit (because it's not sending "heartbeat" messages), and will quit too. When the client gets out of its blockage, it will notice the apps disappeared, and restart them.

The user-visible behavior of this set of problems is strange: his Internet connection went down and his BOINC Manager hanged, sometimes coming back to show the bad news that WUs were giving errors: "app quit with zero status but no finished file, if this happens repeatedly you may want to reset". The symptoms seem completely unrelated.

The full problem chain for that situation is: Internet is down, client tries to contact project, gets blocked for a whole minute before timing out on the DNS request; meanwhile, the manager is blocked waiting for a GUIRPC reply from the client, and science apps kill themselves thinking the core client quit because it's not sending heartbeats. When the client finally times out the DNS request, it notices science apps are gone, restarts them, and starts answering GUI RPCs again. But since a whole minute passed, it's possible the backoff for another project reached zero by now, so the process repeats! Until the Internet connection is back working, or all projects get large-ish backoffs, or the user quickly disables network activity before it hangs again.

A lot of this description may be outdated by now. I'm saying what used to happen back when I saw this problem on my own machines. Now the situation has surely changed. For example, BOINC Manager has async RPCs, and I think the current Windows binaries have async DNS enabled (they only lacked it for a short period of time). But since I don't use Windows on this computer anymore (Ubuntu's libcurl definitely has async DNS and I use self-compiled BOINC), and I have never used 6.4/6.6, I don't know what the current situation really is on the most common platform.

End of info from Nicolas

I asked Rom if the indows client now uses async DNS, and he answered: "Nope, last time we tried libcurl was still randomly crashing."

So I think this is now a Windows-only issue, and I am transferring it to Rom.

Note: See TracTickets for help on using tickets.