Version 8 (modified by 16 years ago) (diff) | ,
---|
Work fetch and GPUs
Problems with the current work fetch policy
The current work-fetch policy is essentially:
- Do a weighted round-robin simulation, computing overall CPU shortfall
- If there's a shortfall, request work from the project with highest LTD
The scheduler request has a scalar "work_req_seconds" indicating the total duration of jobs being requested.
This policy has various problems.
- There's no way for the client to say "I have N idle CPUs; send me enough jobs to use them all".
And many problems related to GPUs:
- There may be no CPU shortfall, but GPUs are idle; no work will be fetched.
- If a GPU is idle, we should get work from a project that potentially has jobs for it.
- If a project has both CPU and GPU jobs, we may need to tell it to send only GPU (or only CPU) jobs.
- LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningly comparison between projects that use only GPUs, or between a GPU project and a CPU project.
This document proposes a work-fetch system that solves these problems.
For simplicity, the design assumes that there is only one GPU time (CUDA). It is straightforward to extend the design to handle additional GPU types.
Terminology
A job sent to a client is associated with an app version, which uses some number (possibly fractional) of CPUs and CUDA devices.
- A CPU job is one that uses only CPU.
- A CUDA job is one that uses CUDA (and may use CPU as well).
Scheduler request
New fields in scheduler request message:
double cpu_req_seconds: number of CPU seconds requested
double cuda_req_seconds: number of CUDA seconds requested
double ninstances_cpu: send enough jobs to occupy this many CPUs
double ninstances_cuda: send enough jobs to occupy this many CUDA devs
For compatibility with old servers, the message still has work_req_seconds; this is the max of (cpu,cuda)_req_seconds.
Client
New abstraction: processing resource or PRSC. There are two processing resource types: CPU and CUDA.
Each PRSC has its own set of data related to work fetch. This is stored in an object of class PRSC_WORK_FETCH.
Its data members are:
double shortfall: shortfall for this resource double max_nidle: number of idle instances
Its member functions are:
clear(): called at the start of RR simulation
prepare(): called before exists_fetchable_project(). sees if there's project to req from for this resource, and caches it
bool exists_fetchable_project(): there's a project we can ask for work for this resource
select_project(priority, char buf): if the importance of getting work for this resource is P, chooses and returns a PROJECT to request work from, and a string to put in the request message Choose the project for which LTD + expected payoff is largest
Values for priority:
- DONT_NEED: no shortfalls
- NEED: a shortfall, but no idle devices right now
- NEED_NOW: idle devices right now
runnable_resource_share(): total resource share of projects with runnable jobs for this resource.
get_priority()
bool count_towards_share(PROJECT p): whether to count p's resource share in the total for this rsc
whether we've got a job of this type in last 30 days
add_shortfall(PROJECT, dt): add x to this project's shortfall, where x = dt*(share - instances used)
double total_share(): total resource share of projects we're counting
accumulate_debt(dt): for each project p:
x = insts of this device used by P's running jobs y = P's share of this device update P's LTD
accumulate_shortfall(dt, i, n):
i = instances in use, n = total instances nidle = n - i max_nidle max= nidle shortfall += dt*(nidle) for each project p for which count_towards_share(p) add_proj_shortfall(p, dt)
Each PRSC also needs to have some per-project data. This is stored in an object of class PRSC_PROJECT_DATA. Its members include (* means save in state file):
double shortfall
int last_job*: last time we had a job from this proj using this rsc if the time is within last N days (30?) we assume that the project may possibly have jobs of that type
bool runnable
max deficit
backoff timer*: how long to wait until ask project for work only for this rsc double this any time we ask only for work for this rsc and get none (maximum 24 hours) clear it when we have a job that uses the rsc
double share: # of instances this project should get based on RS
double long_term_debt*
debt accounting
for each resource type R.accumulate_debt(dt)
RR simulation
do simulation as current on completion of an interval dt cpu_work_fetch.accumulate_shortfall(dt) cuda_work_fetch.accumulate_shortfall(dt)
scheduler request msg double work_req_seconds double cuda_req_seconds bool send_only_cpu bool send_only_cuda double ninstances_cpu double ninstances_cuda
work fetch
We need to deal w/ situation where there's GPU shortfall
but no projects are supplying GPU work. We don't want an overall backoff from those projects. Solution: maintain separate backoff timer per resource
send_req(p)
switch cpu_work_fetch.priority
case DONT_NEED
set no_cpu in req message
case NEED, NEED_NOW:
work_req_sec = p.cpu_shortfall ncpus_idle = p.max_idle_cpus
switch cuda_work_fetch.priority
case DONT_NEED
set no_cuda in the req message
case NEED, NEED_NOW:
for prior = NEED_NOW, NEED
for each coproc C (in decreasing order of importance) p = C.work_fetch.select_proj(prior, msg);
if p
put msg in req message send_req(p) return
else
p = cpu_work_fetch(prior)
if p
send_req(p) return
When get scheduler reply
if request.
scheduler }}}