Version 2 (modified by 16 years ago) (diff) | ,
---|
Client CPU/GPU scheduling
Prior to version 6.3, the BOINC client assumed that each running application uses 1 CPU. Starting with version 6.3, this is generalized.
- Apps may use coprocessors (such as GPUs).
- The number of CPUs used by an app may be more or less than one, and it need not be an integer.
For example, an app might use 2 CUDA GPUs and 0.5 CPUs. This information is visible in the BOINC Manager.
The client's scheduler (i.e., the decision of which apps to run) has been modified to accommodate this diversity of apps.
The way things used to work
The old scheduling policy is:
- Order runnable jobs by "importance" (determined by whether the job is in danger of missing its deadline, and the long-term debt of its project).
- Run jobs in order of decreasing importance. Skip those that would exceed RAM limits. Keep going until we're running NCPUS jobs.
There's a bit more to it than that - e.g., we avoid preempting jobs that haven't checkpointed recently - but that's the basic idea.
How things work in 6.3
The main design goal of the new scheduler is to use all resources. In particular, we try to always use the GPU even if that means overcommitting the CPU. "Overcommitting" means running a set of apps whose demand for for CPUs exceeds the actual number of CPUs.
The new policy is:
- Scan the set of runnable jobs in decreasing order of importance.
- If a job uses a resource that's not already fully utilized, and fits in RAM, run it.
Example: suppose we're on a machine with 1 CPU and 1 GPU, and that we have the following runnable jobs (in order of decreasing importance):
1) 1 CPU, 0 GPU 2) 1 CPU, 0 GPU 3) .5 CPU, 1 GPU
What should we run? If we use the old policy we'll just run 1), and the GPU will be idle. This is bad - the GPU typically is 50X faster than the CPU, and it seems like we should use it if at all possible.
The new policy will do the following:
- Run job 1.
- Skip job 2 because the CPU is already fully utilized.
- Run job 3 because the GPU is not fully utilized.
So we end up running jobs whose CPU demand is 1.5. That's OK - they just run slower than if running alone.
Unresolved issues
Apps that use GPUs use the CPU as well. The CPU part typically is a polling loop: it starts a "kernel" on the GPU, waits for it to finish (checking once per .01 sec, say) then starts another kernel.
If there's a delay between when the kernel finishes and when the CPU starts another one, the GPU sits idle and the entire program runs slowly.
The CPU scheduler on Windows doesn't work well, and when the CPU is overcommitted the CPU part of GPU applications doesn't run as often as it needs to in order to keep the GPU "fed". As a result the GPU is underutilized and the program runs slowly. (This seems to happen even if the GPU app is run at high priority while other apps run at low priority).
If we can't resolve this we'll have to change the scheduling policy to avoid overcommitting the CPU in the presence of GPU apps.