#512 closed Defect (fixed)
CPU Time not updated under linux
Reported by: | tstrunk | Owned by: | Bruce Allen |
---|---|---|---|
Priority: | Minor | Milestone: | Undetermined |
Component: | BOINC - API | Version: | |
Keywords: | cpu_time | Cc: |
Description (last modified by )
Since the beginning of november (first noticed there), the cpu time update in the client doesn't work anymore under linux with our app. When downgrading the boinc libs of our client app to for example rev. 13231, it works again.
I added this line to boinc_checkpoint_completed in boinc_api.C:1010
fprintf(stderr,"in Checkpoint complete: cur_cpu = %g, last_wu = %g, last checkp = %g\n",cur_cpu,last_wu_cpu_time,last_checkpoint_cpu_time); (before) update_app_progress(last_checkpoint_cpu_time, last_checkpoint_cpu_time);
and this one to timer_handler: 852
fprintf(stderr,"cur cpu = %g, initial_wu_cpu_time=%g , last_wu = %g , last_checkpoint = %g\n",cur_cpu,initial_wu_cpu_time,last_wu_cpu_time,last_checkpoint_cpu_time); (also before) update_app_progress(last_wu_cpu_time, last_checkpoint_cpu_time);
From this I got the output:
cur cpu = 0, initial_wu_cpu_time=0 , last_wu = 0 , last_checkpoint = 0 in Checkpoint complete: cur_cpu = 58.4437, last_wu = 58.4437, last checkp = 58.4437 cur cpu = 0, initial_wu_cpu_time=0 , last_wu = 0 , last_checkpoint = 58.4437
So for me this sounds like boinc_worker_thread_cpu_time() sometimes works and sometimes doesn't. I think, this is the case, because boinc_checkpoint_completed is called from the real worker thread, while timer_handler is called from "Somewhere Else (TM)"
A slight guess at what could have produced this behaviour is this changeset:
[13880/trunk/boinc/api/boinc_api.C]
Change History (6)
comment:1 Changed 17 years ago by
comment:2 Changed 17 years ago by
I didn't build yet, but I found something interesting here:
http://nptl.bullopensource.org/ml_nptl/nptl-200410/msg00005.html
"With NPTL, the timing is thread-wise only. That is, the timing is given only for the thread that calls getrusage(), resp. times(). This deviates from SUSv3."
Also here:
http://www.ussg.iu.edu/hypermail/linux/kernel/0406.1/0929.html
"getrusage dosn't work (and didn't do so in pre-NPTL-times) as the time spent in threads is not taken into account."
I think with this I found the culprit, that is NPTL - getrusage only gives back the CPU time used by the calling thread. I did a testbuild on my laptop with glibc 2.7 and it still didn't update the cpu time correctly. So basically - this is no BOINC bug anymore and this bug report can be closed, as building with linuxthreads might fix it. A workaround for NPTL would be the behaviour before changeset 13855.
comment:4 Changed 17 years ago by
I think I got it the wrong way. Up until now, I used to statically link the linuxthreads pthread library to support kernel 2.4 - Therefore I see three processes in the ps output. To try a few other configurations I just dynamically linked our application once - then the cpu time display was no problem (with NPTL on my machine here).
As there are too many if's and when's in the comments by me already, I think it's better to address the mailing list for this problem, which I will do now.
Apart from that: Thanks for the formatting Nicolas!
comment:5 Changed 17 years ago by
I think I got it:
getrusage(RUSAGE_SELF) gets the cpu time of the process itself and all the child processes. Using linuxthreads the threads are different processes: The worker thread has the two other threads as child processes - invoking getrusage from it succeeds (therefore updating on checkpoint works). Invoking getrusage from the other two threadprocesses fails, because the worker thread is not a child process from them.
With NPTL everything is one process, so it works from every thread. Because of this and because linuxthreads static-linking is mandatory for kernel 2.4 support, I'd vote for a rollback to the old behaviour (calling from the worker thread).
comment:6 Changed 17 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
So a bit more information: I build on a i386 machine (debian 4.0)
Two boinc worker processes show up as six different processes each looking like this in ps axjf:
getconf GNU_LIBPTHREAD_VERSION gives: NPTL 2.3.6
And now that I think of it changeset 13855 seems to fit my problem more:
http://boinc.berkeley.edu/trac/changeset/13855/trunk/boinc/api/boinc_api.C
I will now try to build with revisions 13854 and 13855 and see if this causes my problem.