#705 closed Defect (fixed)
Broadband fault causes BOINC 6.2.14 crash
Reported by: | Thyme Lawn | Owned by: | davea |
---|---|---|---|
Priority: | Critical | Milestone: | 6.2 |
Component: | Client - Daemon | Version: | 6.2.14 |
Keywords: | network | Cc: |
Description
An exchange hardware fault caused my (always on) broadband connection to drop last night. BOINC 6.2.14 (protected install on XP and Vista) had stopped running on both systems when I checked them this morning.
Looking at stdoutdae.txt it's clear that BOINC didn't detect the network failure and everything was fine as long as it was only attempting scheduler requests. As soon as an upload was added into the equation it crashed.
Here's a scheduler request from stdoutdae.txt after the connection had failed:
29-Jul-2008 00:50:08 [CPDN Beta] Sending scheduler request: To send trickle-up message. Requesting 0 seconds of work, reporting 0 completed tasks 29-Jul-2008 00:50:11 [---] Project communication failed: attempting access to reference site 29-Jul-2008 00:50:12 [---] Internet access OK - project servers may be temporarily down. 29-Jul-2008 00:50:13 [CPDN Beta] Scheduler request failed: Server returned nothing (no headers, no data)
Note that the reference site check is being made before the scheduler request has failed and is being marked as successful.
The trickle-up and reference file check was retried 9 times before the following sequence when boinc.exe crashed ('normal' scheduler requests take priority over trickle-ups):
29-Jul-2008 02:51:10 [malariacontrol.net] Computation for task wu_133_524_149170_0_1217280246_0 finished 29-Jul-2008 02:51:10 [malariacontrol.net] Sending scheduler request: To fetch work. Requesting 818 seconds of work, reporting 1 completed tasks 29-Jul-2008 02:51:12 [---] Project communication failed: attempting access to reference site 29-Jul-2008 02:51:12 [malariacontrol.net] Started upload of wu_133_524_149170_0_1217280246_0_0
BOINC Windows Runtime Debugger didn't generate any stack traces on the XP system but on the Vista system the trace in stderrdae.txt indicates that the crash was in the libcurl function curl_multi_remove_handle():
BOINC Windows Runtime Debugger Version 6.2.14 Dump Timestamp : 07/29/08 02:51:13 Debugger Engine : 4.0.5.0
*** Dump of thread ID 44492 (state: Waiting): *** - Information - Status: Wait Reason: UserRequest, , Kernel Time: 87828560.000000, User Time: 71604456.000000, Wait Time: 19143612.000000 - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0016D9FC read attempt to address 0x27273D84 - Registers - eax=01e40278 ebx=00d3fe00 ecx=00d3fe00 edx=00001caa esi=27273d74 edi=00000000 eip=0016d9fc esp=0129fda0 ebp=0129fe6c cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206 - Callstack - ChildEBP RetAddr Args to Child 0129fe6c 0040c86f 00468c18 00000000 3fc68730 7554eab9 libcurl!curl_multi_remove_handle+0x0 0129fef0 00431e51 00000000 3fc68730 76cae0c5 00000000 boinc!+0x0 0129ff68 0043b467 00000000 001d19a0 76cad1da 00000001 boinc!+0x0 0129ff88 75854911 001d19a0 0129ffd4 76fce4b6 001d19a0 boinc!+0x0 0129ff94 76fce4b6 001d19a0 7dc5be09 00000000 00000000 kernel32!BaseThreadInitThunk+0x0 0129ffd4 76fce489 76cad1b9 001d19a0 00000000 00000000 ntdll!RtlInitializeExceptionChain+0x0 0129ffec 00000000 76cad1b9 001d19a0 00000000 43534552 ntdll!RtlInitializeExceptionChain+0x0
While I was waiting for the faulty line card to be replaced I tried to get BOINC running again on both systems. Shortly after network operations started all tasks stopped running with heartbeat failures:
No heartbeat from core client for 31 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC...
Tasks could only be kept running by suspending networking until the exchange problem was fixed.
Change History (5)
comment:1 Changed 16 years ago by
Milestone: | Undetermined → 6.2 |
---|
comment:2 Changed 16 years ago by
comment:3 Changed 16 years ago by
After intensive investigation I've worked out what's going wrong; it's a buffer overrun in HTTP_OP_SET::got_select(). File upload uses HTTP_OP::init_post2(), with the working buffer FILE_XFER::header[4096] being used to store the HTTP messages in memory.
When the broadband link goes down my router responds to every HTTP request with HTTP/1.1 303 See Other, redirecting the request to the router's link fault recovery instructions. cURL writes that page to a temporary file, which line 1049 of http_curl.C reads into the working buffer. The number of bytes read is the size of the temporary file. In this case that's 5887 bytes, which causes an overrun of 793 bytes.
The req1 parameter passed to fread() is the buffer pointer passed to HTTP_OP::init_post2(). Although that implies that the buffer size is variable, in practice the buffer is always FILE_XFER::header, so changing the maximum number of bytes read to 4095 would prevent the overrun. I've tested the patch below, but a more correct fix would be to use the buffer size passed as an additional parameter to init_post2().
Index: http_curl.C =================================================================== --- http_curl.C (revision 15671) +++ http_curl.C (working copy) @@ -1036,6 +1036,10 @@ // read in the temp file into req1 memory // size_t dSize = ftell(hop->fileOut); + if (dSize > 4095) { + // ensure the fread() below doesn't overrun the buffer. + dSize = 4095; + } retval = fseek(hop->fileOut, 0, SEEK_SET); if (retval) { // flag as a bad response for a possible retry later
comment:4 Changed 16 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
(In [15744]) - client: fix crash in this scenario:
A file upload sends request. The network is down, and something (e.g. a router) sends a long (> 4KB) error page. This overruns the 4KB buffer of HTTP_OP::req1. Solution: keep track of the size of the buffer, and don't overrun it. Also move the body of a huge for loop into a separate function. From Ian Hay. Fixes #705
comment:5 Changed 16 years ago by
(In [15862]) - client: fix crash in this scenario:
A file upload sends request. The network is down, and something (e.g. a router) sends a long (> 4KB) error page. This overruns the 4KB buffer of HTTP_OP::req1. Solution: keep track of the size of the buffer, and don't overrun it. Also move the body of a huge for loop into a separate function. From Ian Hay. Fixes #705
client/
file_xfer.C http_curl.C,h
Maybe we have to change to newest libcurl version.
curl changelog