Opened 17 years ago
Closed 14 years ago
#347 closed Defect (fixed)
large amount of tasks = boinc and boincmgr hogging all CPU time
Reported by: | Aaron Finney | Owned by: | romw |
---|---|---|---|
Priority: | Major | Milestone: | 6.12 |
Component: | Manager | Version: | |
Keywords: | patch | Cc: |
Description
For some reason, the more tasks you have in your queue, the more CPU time that BOINC.exe and boincmgr.exe are needing, I'm assuming to update the grid in boinc manager. At some point, the CPU time required to parse the list of tasks in the queue reaches a 'critical mass' and boinc.exe and boincmgr.exe just use the CPU time up to 100% updating themselves. No scientific work can get through, because boinc.exe and boincmgr.exe have higher priority levels.
Attachments (4)
Change History (23)
Changed 17 years ago by
Attachment: | boinc50%.JPG added |
---|
Changed 17 years ago by
Attachment: | boinc&boincmgrcpuhog.JPG added |
---|
Boinc manager with list of large tasks taking up 50% of CPU time.
Changed 17 years ago by
Attachment: | boincwithsmalltasks.JPG added |
---|
Task manager after removing 90% of tasks.
comment:1 Changed 17 years ago by
This also can cause BOINC and the BOINC manager to become completely unresponsive.
comment:2 Changed 17 years ago by
Component: | Undetermined → Client - Daemon |
---|---|
Milestone: | Undetermined → 5.10 |
Owner: | set to davea |
comment:3 Changed 17 years ago by
Well, that's a huge list of tasks! It would need some profiling to know what piece of code is being the bottleneck.
comment:4 Changed 17 years ago by
Owner: | changed from davea to romw |
---|
comment:5 Changed 17 years ago by
An idea:
- Add a GUI RPC to get details of a single result (or better: a specific list of results).
- Make the manager update every second only the results in "Running" state, and refresh the whole list using
get_results
on longer periods, like every 10 seconds.
If a new manager connects to an old client not supporting the RPC, it should notice the RPC is missing and avoid it. After connection, the first time the manager refreshes the result list, try the new RPC. If it fails with "RPC unrecognized", fallback to the current method of getting all results every second, and don't retry the new RPC during the rest of that connection.
comment:6 Changed 17 years ago by
An addition to my previous suggestion: because of the way the <get_results>
tag is parsed, we don't really need a new RPC. Just more parameters to the existing RPC. It would be even better from a backwards compatibility standpoint. Sending an RPC like the following:
<boinc_gui_rpc_request> <get_results> <filter>scheduler_state eq 2</filter> </get_results> </boinc_gui_rpc_request>
currently sends back the whole list of results without any error. A future version of the client could actually parse that tag to send only results that match the expression, and a future manager could take advantage of it, without any compat. problem. If the manager sends a filter, but gets the full list anyway, no problem, just update whole list on the GUI as it does now.
Probably such a generic <filter>
expression is too complex to implement. That's just an example; any filter syntax would work.
comment:7 Changed 17 years ago by
I just did some profiling of the manager. Unfortunately I don't have access to a client with more than 5 results at the moment, therefore I did not profile in the same situation described here.
However, my first results show that a lot of CPU time is used for memory management and string manipulation. Most of this seems to come from internal wxWidgets functions. A particular function of the manager than seems to be expensive is RESULT::parse in gui_rpc_client_ops.C. This would be consistent with the observation described in this ticket. Unfortunately I don't see any easy way to optimise this function, except Nicolas'. But this seems to be a quite big change to the manager and the core client.
I think I will run another profiling session as soon as I am able to persuade my core client to fetch a larger amount of work.
comment:8 Changed 17 years ago by
Another solution came to my mind: What about turning the static refresh frequency into a dynamic one? The delay between two refresh-cycles could be proportional to the number of results (or more abstract: proportional to the number of elements in the current view). That should help to keep the CPU usage of the core client and the manager on a constant low level.
If there are no objections against this proposal I would like to write a patch for it.
Changed 17 years ago by
Attachment: | boincmgr_347.patch added |
---|
Patch implementing the dynamic refresh rate
comment:9 Changed 16 years ago by
Keywords: | patch added |
---|
comment:10 Changed 16 years ago by
Milestone: | 6.4 → 6.2 |
---|
This is a blocker originally for 5.10! There is a patch provided.
Resetting milestone to 6.2. Major bugs should not be version-bumped without explanation.
comment:12 Changed 16 years ago by
Component: | Client - Daemon → Manager |
---|---|
Milestone: | 6.6 → 6.8 |
comment:13 Changed 16 years ago by
Priority: | Blocker → Major |
---|
comment:14 Changed 15 years ago by
Milestone: | 6.8 → 6.10 |
---|
Moving this to 6.10.
The patch cannot be applied as we are now using the Async RPC mechinism.
The decided solution to the problem is too spin up a new thread to monitor the async thread and throttle it back so that it cannot use more than 5% of the CPU. The async thread may need to be broken up into two different threads so that on-demand RPCs are not subject to being throttled.
comment:15 Changed 15 years ago by
No more threads.
1) show only active tasks; button to show all
2) reduce refresh rate if CPU usage is above threshold
comment:17 Changed 15 years ago by
If you plan to add a new RPC to get only active tasks, remember what I mentioned before in this ticket: adding tags inside <get_results>
is fully backwards compatible. Older clients will return the full result list with no problems. So don't make a new RPC, use something like:
<get_results><only_active/></get_results>
comment:19 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
boinc taking up 50% of resource share