Version 30 (modified by 16 years ago) (diff) | ,
---|
Work fetch and GPUs
This document describes changes to BOINC's work fetch mechanism in the 6.6 client and the scheduler as of [17024].
Problems with the old work fetch policy
The old work-fetch policy is essentially:
- Do a weighted round-robin simulation, computing the CPU shortfall (i.e., the idle CPU time we expect during the work-buffering period).
- If there's a CPU shortfall, request work from the project with highest long-term debt (LTD).
The scheduler request has a single "work_req_seconds" indicating the total duration of jobs being requested.
This policy has some problems:
- There's no way for the client to say "I have N idle CPUs; send me enough jobs to use them all".
And various problems related to GPUs:
- If there is no CPU shortfall, no work will be fetched even if GPUs are idle.
- If a GPU is idle, we should get work from a project that potentially has jobs for it.
- If a project has both CPU and GPU jobs, the client should be able to tell it to send only GPU (or only CPU) jobs.
- LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.
Examples
In following examples, the client is attached to projects A and B with equal resource share.
Example 1
Suppose that:
- A has only GPU jobs and B has both GPU and CPU jobs.
- The host is attached to A and B with equal resource shares.
- The host's GPU is twice as fast as its CPU.
The target behavior is:
- the CPU is used 100% by B
- the GPU is used 75% by A and 25% by B
This provides equal total processing to A and B.
Example 2
A has a 1-year CPU job with no slack, so it runs in high-priority mode. B has jobs available.
Goal: after A's job finishes, B gets the CPU for a year.
Variation: a new project C is attached when A's job finishes. It should immediately share the CPU 50/50 with B.
Example 3
A has GPU jobs but B doesn't. After a year, B gets a GPU app.
Goal: A and B immediately share the GPU 50/50.
The new policy
Resource types
New abstraction: processing resource type or just "resource type". Examples of resource types:
- CPU
- A coprocessor type (a kind of GPU, or the SPE processors in a Cell)
Currently there are two resource types: CPU and NVIDIA GPUs.
Summary of the new policy: it's like the old policy, but with a separate copy for each resource type, and scheduler requests can now ask for work for particular resource types.
Per-resource-type backoff
We need to keep track of whether projects have work for particular resource types, so that we don't keep asking them for types of work they don't have.
To do this, we maintain a separate backoff timer per (project, resource type). The backoff interval is doubled up to a limit (1 day) whenever we ask for work of that type and don't get any work; it's cleared whenever we get a job of that type. Note: if we decide to ask a project for work for resource A, we may ask it for resource B as well, even if it's backed off for B.
This is independent of the overall backoff timer for each project, which is triggered by requests from the project, RPC failures, job errors and so on.
Long-term debt
We continue to use the idea of long-term debt (LTD), representing how much work (measured in device instance-seconds) is "owed" to each project P. This increases over time in proportion to P's resource share, and decreases as P uses resources. Simplified summary of the new policy: when we need work for a resource R, we ask the project that is not backed off for R and whose LTD is greatest.
The notion of LTD needs to span resources; otherwise, in the above example, projects A and B would each get 50% of the GPU.
On the other hand, if there's a single cross-resource LTD, and only one project has GPU jobs, then its LTD would go unboundedly negative, and the others would go unboundedly positive. This is undesirable. It could be fixed by limiting the LTD to a finite range, but this would lose information.
In the new model:
- There is a separate LTD for each resource type
- The "overall LTD", used in the work-fetch decision, is the sum of the resource LTDs, weighted by the speed of the resource (FLOPs per instance-second).
Per-resource LTD is maintained as follows:
A project is "debt eligible" for a resource R if:
- P is not backed off for R, and the backoff interval is not at the max.
- P is not suspended via GUI, and "no more tasks" is not set
Debt is adjusted as follows:
- For each debt-eligible project P, the debt is increased by the amount it's owed (delta T times its resource share relative to other debt-eligible projects) minus the amount it got (the number of instance-seconds).
- An offset is added to debt-eligible projects so that the net change is zero. This prevents debt-eligible projects from drifting away from other projects.
- An offset is added so that the maximum debt across all projects is zero (this ensures that when a new project is attached, it starts out debt-free).
Summary of the new policy
Every 60 seconds, and when various events happen (e.g. jobs finish), the following is done. CI is the "connect interval" preference; AW is the "additional work" preference.
Auxiliary functions:
get_major_shortfall(resource)
If the resource will have an idle instance before CI, return the greatest-overall-debt non-backed-off project P (P may be overworked). Otherwise return NULL.
get_minor_shortfall(resource)
If the resource will have an idle instance between CI and CI+AW, return the greatest-overall-debt non-backed-off non-overworked project P
get_starved_project(resource)
If any project is not overworked, not backed off, and has no runnable jobs for any resource, return the one with greatest overall debt
Main logic:
- Do a round-robin simulation of currently queued jobs.
- p = get_major_shortfall(NVIDIA GPU); if p <> NULL, ask it for work and return
- ... same for other coprocessor types (we assume that coprocessors are faster, hence more imporant, than CPU)
- ... same, for CPU
- p = get_minor(shortfall(NVIDIA GPU); if p <> NULL, ask it for work and return
- ... same for other coprocessor types, then CPU
- p = get_starved_project(NVIDIA GPU); if p <> NULL, ask it for work and return
- ... same for other coprocessor types, then CPU
In the get_major_shortfall() case, ask only for work of that resource type. Otherwise ask for all types of work for which there is a shortfall.
Implementation notes
A job sent to a client is associated with an app version, which uses some number (possibly fractional) of CPUs, and some number of instances of a particular coprocessor type.
Scheduler request and reply message
New fields in the scheduler request message:
- double cpu_req_secs
- number of CPU seconds requested
- double cpu_req_instances
- send enough jobs to occupy this many CPUs
And for each coprocessor type:
- double req_secs
- number of instance-seconds requested
- double req_instances
- send enough jobs to occupy this many instances
The semantics: a scheduler should send jobs for a resource type only if the request for that type is nonzero.
For compatibility with old servers, the message still has work_req_seconds, which is the max of the req_seconds.
Client data structures
- RSC_WORK_FETCH
- The work-fetch state for a particular resource type. There are instances for CPU (cpu_work_fetch) and NVIDIA GPUs (cuda_work_fetch).
- RSC_PROJECT_WORK_FETCH
- The work-fetch state for a (resource type, project pair).
- PROJECT_WORK_FETCH
- Per-project work fetch state.
- WORK_FETCH
- Overall work-fetch state.
Scheduler changes
- WORK_REQ has fields for requests (secs, instances) of the various resource types
- WORK_REQ has a field no_gpus indicating that user prefs don't allow using GPUs
- WORK_REQ has a field rsc_spec_request indicating whether it's a new-style request
- for new-style-requests work_needed() returns true if either additional seconds or instances are still needed
- add_result_to_reply() decrements the fields in WORK_REQ
- get_app_version(): if we first chose a GPU version but don't need more GPU work, clear the record so that we can pick another version
- get_app_version(): skip app versions for resource for which we don't need more work.
Notes
The idea of using RAC as a surrogate for LTD was discussed and set aside for various reasons.
This design does not accommodate:
- jobs that use more than one coprocessor type
- jobs that change their resource usage dynamically (e.g. coprocessor jobs that decide to use the CPU instead).