Version 24 (modified by 16 years ago) (diff) | ,
---|
Work fetch and GPUs
Problems with the current work fetch policy
The current work-fetch policy is essentially:
- Do a weighted round-robin simulation, computing the CPU shortfall (i.e., the idle CPU time we expect during the work-buffering period).
- If there's a CPU shortfall, request work from the project with highest long-term debt (LTD).
The scheduler request has a single "work_req_seconds" indicating the total duration of jobs being requested.
This policy has some problems:
- There's no way for the client to say "I have N idle CPUs; send me enough jobs to use them all".
And various problems related to GPUs:
- If there is no CPU shortfall, no work will be fetched even if GPUs are idle.
- If a GPU is idle, we should get work from a project that potentially has jobs for it.
- If a project has both CPU and GPU jobs, the client should be able to tell it to send only GPU (or only CPU) jobs.
- LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.
This document proposes a modification to the work-fetch system that solves these problems.
Example
Suppose that:
- Project A has only GPU jobs and project B has both GPU and CPU jobs.
- A host is attached to projects A and B with equal resource shares.
- The host's GPU is twice as fast as its CPU.
In this case, the target behavior is:
- the CPU is used 100% by project B
- the GPU is used 75% by project A and 25% by project B
This provides equal processing to the two projects.
Terminology
New abstraction: processing resource type or PRT. CPU and each coprocessor type are PRTs.
A job sent to a client is associated with an app version, which uses some number (possibly fractional) of CPUs, and some number of instances of a particular coprocessor type.
This design does not accommodate:
- jobs that use more than one coprocessor type
- jobs that change their resource usage dynamically (e.g. coprocessor jobs that decide to use the CPU instead).
Scheduler request and reply message
New fields in the scheduler request message:
- double cpu_req_secs
- number of CPU seconds requested
- double cpu_req_instances
- send enough jobs to occupy this many CPUs
And for each coprocessor type:
- double req_secs
- number of instance-seconds requested
- double req_instances
- send enough jobs to occupy this many instances
For compatibility with old servers, the message still has work_req_seconds, which is the max of the req_seconds.
The semantics are: a scheduler should send jobs for a resource type only if the request for that type is nonzero.
Client
Long-term debt
We'll continue to use the idea of long-term debt (LTD), which represents how much work (measured in device instance-seconds) is "owed" to each project P. This increases over time in proportion to P's resource share, and decreases as P uses resources. Simplified summary of the new policy: when we need work for a resource, we ask the project that may have that type of job and whose LTD is greatest.
The idea of using RAC as a surrogate for LTD was discussed and set aside for various reasons.
The notion of LTD needs to span resources; otherwise, in the above example, projects A and B would each get 50% of the GPU.
On the other hand, if there's a single cross-resource LTD, and only one project has GPU jobs, then its LTD would go unboundedly negative, and the others would go unboundedly positive. This is undesirable. It could be fixed by limiting the LTD to a finite range, but this would lose information.
The current plan is:
- There is a separate LTD for each resource type
- The "overall LTD", which is used in the work-fetch decision, is the sum of the resource LTDs, weighted by the speed of the resource (FLOPs per instance-second).
Next we need to specify exactly how LTD is maintained. It's clear how it decreases; the question is, how is it increased? We need to avoid situations where LTD increases without bound. We propose the following:
- For each project P and resource R there is a boolean flag D(P, R) indicating whether P should accumulate debt for R. The idea is that if D(P,R) is true, then it's likely that P would supply a job for R if we asked it.
- D(P, R) is initially false.
- If P supplies a job for R, D(P,R) is set to true.
- If we send P a request that doesn't return any jobs, then for each resource R for which req_seconds(R)>0, D(P,R) is set to false.
* Proposed change. This could be too sensitive to temporary outages - why not have the project respond with information about whether the resource type is currently supported in some other form than the project currently has work. This would mean a 3 state response - "Work is returned", "No work, but the resource is supported", and "The resource is not supported".
Per-resource-type backoff
We need to handle the situation where e.g. there's a GPU shortfall but no projects are supplying GPU work (for either permanent or transient reasons). We don't want an overall work-fetch backoff from those projects.
Instead, we maintain a separate backoff timer per (project, PRSC). The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work; it's cleared whenever we get a job of that type.
* Proposed clarification - the overall contact backoff would be the minimum of the backoff for each resource type. * Question: If the project asks for a communications backoff, and one of the resource type backoffs would expire within the project requested backoff, how do we handle that? * Question: If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types?
Work-fetch state
Each PRSC has its own set of data related to work fetch. This is stored in an object of class PRSC_WORK_FETCH.
Data members of PRSC_WORK_FETCH:
- ninstances
- number of instances of this resource type
Used/set by rr_simulation()):
- double shortfall
- shortfall for this resource
- double nidle
- number of currently idle instances
Member functions of PRSC_WORK_FETCH:
- rr_init()
- called at the start of RR simulation. Compute project shares for this PRSC, and clear overall and per-project shortfalls.
- set_nidle()
- called by RR sim after initial job assignment.
Set nidle to # of idle instances.
- accumulate_shortfall(dt)
- called by RR sim for each time interval during work buf period.
shortfall += dt*(ninstances - instances in use) for each project p not backed off for this PRSC p->PRSC_PROJECT_DATA.accumulate_shortfall(dt)
- select_project()
- select the best project to request this type of work from. It's the project not backed off for this PRSC, and for which LTD + p->shortfall is largest, also taking into consideration overworked projects etc.
- accumulate_debt(dt)
for each project p:
x = insts of this device used by P's running jobs y = P's share of this device update P's LTD
Each PRSC also needs to have some per-project data. This is stored in an object of class PRSC_PROJECT_DATA. It has the following "persistent" members (i.e., saved in state file):
- backoff timer*
- how long to wait until ask project for work specifically for this PRSC;
double this any time we ask for work for this rsc and get none (maximum 24 hours). Clear it when we ask for work for this PRSC and get some job.
And the following transient members (used by rr_simulation()):
- double share
- # of instances this project should get based on resource share
relative to the set of projects not backed off for this PRSC.
- instances_used
- # of instances currently being used
- double shortfall
- accumulate_shortfall(dt)
-
shortfall += dt*(share - instances_used)
Each project has the following work-fetch-related state:
- double long_term_debt*
- the amount of processing (including GPU, but expressed in terms of CPU seconds) owed to this project.
debt accounting
for each resource type R for each project P if P is not backed off for R P.R.LTD += share for each running job J, project P for each resource R used by J P.R.LTD -= share*dt
RR simulation
cpu_work_fetch.rr_init() cuda_work_fetch.rr_init() compute initial assignment of jobs cpu_work_fetch.set_nidle(); cuda_work_fetch.set_nidle(); do simulation as current on completion of an interval dt cpu_work_fetch.accumulate_shortfall(dt) cuda_work_fetch.accumulate_shortfall(dt)
Work fetch
rr_simulation() if cuda_work_fetch.nidle cpu_work_fetch.shortfall = 0 p = cuda_work_fetch.select_project() if p send_req(p) return if cpu_work_fetch.nidle cuda_work_fetch.shortfall = 0 p = cpu_work_fetch.select_project() if p send_req(p) return if cuda_work_fetch.shortfall p = cuda_work_fetch.select_project() if p send_req(p) return if cpu_work_fetch.shortfall p = cpu_work_fetch.select_project() if p send_req(p) return void send_req(p) req.cpu_req_seconds = cpu_work_fetch.shortfall req.cpu_req_ninstances = cpu_work_fetch.nidle req.cuda_req_seconds = cuda_work_fetch.shortfall req.cuda_req_ninstances = cuda_work_fetch.nidle req.work_req_seconds = max(req.cpu_req_seconds, req.cuda_req_seconds)
Handling scheduler reply
if no jobs returned double backoff for each requested PRSC else clear backoff for the PRSC of each returned job
Scheduler changes
global vars have_cpu_app_versions have_cuda_app_versions per-request vars bool coproc_request ncpu_jobs_sending ncuda_jobs_sending ncpu_seconds_to_fill ncuda_seconds_to_fill seconds_to_fill (backwards compat; used if !coproc_request) overall startup scan app versions, set have_x vars req startup if send_only_cpu and no CPU app versions, don't send work if send_only_cuda and no CUDA app versions, don't send work work_needed() need_more_cpu_jobs = n_cpu_jobs_sending < ninstances_cpu or cpu_seconds_to_fill > 0 same for cuda return false if don't need more CPU or more CUDA get_app_version if send_only_cpu, ignore CUDA versions if send_only_cuda, ignore CPU versions when commit a job update n*_jobs_sending, n*_seconds_to_fill, seconds_to_fill