Opened 11 years ago

Last modified 11 years ago

#1239 new Defect

Work Fetch - Leaves part of a GPU unused, when it should instead fetch work

Reported by: JacobKlein Owned by: davea
Priority: Undetermined Milestone: Undetermined
Component: Client - Work Fetch Policy Version: 7.0.60
Keywords: work fetch portion unused GPU Cc: Jacob_W_Klein@…

Description

If a project's GPU apps are setup to use only part of the GPU (ie: app_config.xml), then when the last remaining task(s) for that project are running and not utilizing the full GPU, work fetch should fetch more, but doesn't.

This issue was confirmed with both 7.0.60, as well as on 4/8/2013 in the simulator (which has several unreleased work fetch changes).

It would seem that the prerequisites to reproducing the bug are:

  • use an app_config.xml file (to set an app to use part of a GPU, so multiple tasks could run at the same time on the same device).
  • use a small buffer setting

I'm not certain if GPU Exclusions are necessary to create the issue, but I believe that using GPU Exclusions makes this problem worse.

As a workaround, I had to increase my buffer settings way above what I would normally expect. It feels like, in addition to work fetch not realizing a portion of the GPU is idle, it might also not be realizing that the tasks run 2-at-a-time.

Details, including examples in a simulation, are in the email below:


From: jacob_w_klein@…
To: davea@…
Subject: RE: job scheduling
Date: Mon, 8 Apr 2013 09:51:16 -0400

Thank you.  I really appreciate you looking at these issues, and I'll try to verify they work.
Your WCG project sounds interesting; maybe they're going to support Android?
I wish we had a Windows Phone platform, I'd love to test on it.

Do you remember Ed (Beyond) reporting a GPU Exclusion Work Fetch issue?
I might have found examples of what he was trying to explain...

I'm noticing an issue, both on my computer (7.0.60's work fetch algorithm), as well as the simulator (new work fetch algorithm).
If a GPU is only partially-loaded (ie: 0.5 GPU) by the last remaining task(s) for a project that has GPU-Exclusions,
We get into a scenario where GPUs are left part-idle, and work fetch won't fetch more.

The task scheduler (correctly) schedules the workload, which is scheduled in a way where a GPU is left part-idle,
But work fetch thinks we have plenty of work, and sees no fully idle instances, so it doesn't ask for any.

Here are some examples where that occurred, even with our work fetch changes:

http://boinc.berkeley.edu/dev/sim_web.php?action=show_simulation&scen=86&sim=26
2 days 17:03:00
3 days 14:33:00
6 days 06:13:00
8 days 16:43:00
9 days 16:07:00

The fix might involve evaluating the project's GPU apps to see if it has any that use partial GPU
... or maybe checking to see that all of its GPU apps use <= amount of currently idle GPU (to ensure we don't keep asking/getting work we cannot immediately use)

It sounds to me like the fix for this one might be tricky instead of straight-forward, though I'm not sure.
Do you plan on tackling this soon (fixed in short term), or should I create a ticket (fixed eventually, maybe months/years)?

Regards,
Jacob

Change History (1)

comment:1 Changed 11 years ago by JacobKlein

Another example where this happened in the simulator, which is in fact much simpler, can be found here:
http://boinc.berkeley.edu/dev/sim_web.php?action=show_scenario&name=94

At the beginning, we should have fetched more work from GPUGrid.net, but we don't, and erroneously leave half the GPU idle.

Note: See TracTickets for help on using tickets.