Opened 16 years ago

Closed 16 years ago

Last modified 15 years ago

#512 closed Defect (fixed)

CPU Time not updated under linux

Reported by: tstrunk Owned by: Bruce Allen
Priority: Minor Milestone: Undetermined
Component: BOINC - API Version:
Keywords: cpu_time Cc:

Description (last modified by Nicolas)

Since the beginning of november (first noticed there), the cpu time update in the client doesn't work anymore under linux with our app. When downgrading the boinc libs of our client app to for example rev. 13231, it works again.

I added this line to boinc_checkpoint_completed in boinc_api.C:1010

fprintf(stderr,"in Checkpoint complete: cur_cpu = %g, last_wu = %g, last checkp = %g\n",cur_cpu,last_wu_cpu_time,last_checkpoint_cpu_time);
(before) update_app_progress(last_checkpoint_cpu_time, last_checkpoint_cpu_time);

and this one to timer_handler: 852

fprintf(stderr,"cur cpu = %g, initial_wu_cpu_time=%g , last_wu = %g , last_checkpoint = %g\n",cur_cpu,initial_wu_cpu_time,last_wu_cpu_time,last_checkpoint_cpu_time);
(also before) update_app_progress(last_wu_cpu_time, last_checkpoint_cpu_time);

From this I got the output:

cur cpu = 0, initial_wu_cpu_time=0 , last_wu = 0 , last_checkpoint = 0
in Checkpoint complete: cur_cpu = 58.4437, last_wu = 58.4437, last checkp = 58.4437
cur cpu = 0, initial_wu_cpu_time=0 , last_wu = 0 , last_checkpoint = 58.4437

So for me this sounds like boinc_worker_thread_cpu_time() sometimes works and sometimes doesn't. I think, this is the case, because boinc_checkpoint_completed is called from the real worker thread, while timer_handler is called from "Somewhere Else (TM)"

A slight guess at what could have produced this behaviour is this changeset:
[13880/trunk/boinc/api/boinc_api.C]

Change History (6)

comment:1 Changed 16 years ago by tstrunk

So a bit more information: I build on a i386 machine (debian 4.0)
Two boinc worker processes show up as six different processes each looking like this in ps axjf:

3194 3195 3193 2291 pts/3 3193 RN+ 1005 0:37 | \_ poem_0.8_i686-pc-linux-gnu
3195 3199 3193 2291 pts/3 3193 SN+ 1005 0:00 | | \_ poem_0.8_i686-pc-linux-gnu
3199 3200 3193 2291 pts/3 3193 SN+ 1005 0:00 | | \_ poem_0.8_i686-pc-linux-gnu

getconf GNU_LIBPTHREAD_VERSION gives: NPTL 2.3.6

And now that I think of it changeset 13855 seems to fit my problem more:
http://boinc.berkeley.edu/trac/changeset/13855/trunk/boinc/api/boinc_api.C

I will now try to build with revisions 13854 and 13855 and see if this causes my problem.

comment:2 Changed 16 years ago by tstrunk

I didn't build yet, but I found something interesting here:

http://nptl.bullopensource.org/ml_nptl/nptl-200410/msg00005.html

"With NPTL, the timing is thread-wise only. That is, the timing is given only for the thread that calls getrusage(), resp. times(). This deviates from SUSv3."

Also here:

http://www.ussg.iu.edu/hypermail/linux/kernel/0406.1/0929.html

"getrusage dosn't work (and didn't do so in pre-NPTL-times) as the time spent in threads is not taken into account."

I think with this I found the culprit, that is NPTL - getrusage only gives back the CPU time used by the calling thread. I did a testbuild on my laptop with glibc 2.7 and it still didn't update the cpu time correctly. So basically - this is no BOINC bug anymore and this bug report can be closed, as building with linuxthreads might fix it. A workaround for NPTL would be the behaviour before changeset 13855.

comment:3 Changed 16 years ago by Nicolas

Description: modified (diff)

Fix some formatting.

comment:4 Changed 16 years ago by tstrunk

I think I got it the wrong way. Up until now, I used to statically link the linuxthreads pthread library to support kernel 2.4 - Therefore I see three processes in the ps output. To try a few other configurations I just dynamically linked our application once - then the cpu time display was no problem (with NPTL on my machine here).

As there are too many if's and when's in the comments by me already, I think it's better to address the mailing list for this problem, which I will do now.

Apart from that: Thanks for the formatting Nicolas!

comment:5 Changed 16 years ago by tstrunk

I think I got it:

getrusage(RUSAGE_SELF) gets the cpu time of the process itself and all the child processes. Using linuxthreads the threads are different processes: The worker thread has the two other threads as child processes - invoking getrusage from it succeeds (therefore updating on checkpoint works). Invoking getrusage from the other two threadprocesses fails, because the worker thread is not a child process from them.

With NPTL everything is one process, so it works from every thread. Because of this and because linuxthreads static-linking is mandatory for kernel 2.4 support, I'd vote for a rollback to the old behaviour (calling from the worker thread).

comment:6 Changed 16 years ago by davea

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.