Changes between Version 6 and Version 7 of GpuWorkFetch


Ignore:
Timestamp:
Dec 26, 2008, 10:32:43 AM (15 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GpuWorkFetch

    v6 v7  
    11= Work fetch and GPUs =
    22
    3 == Current policy ==
    4 
    5  * Weighted round-robin simulation
    6   * get per-project and overall CPU shortfalls
    7   * see what misses deadline
    8  * If overall shortfall, get work from project with highest LTD
    9  * Scheduler request includes just "work_req_seconds".
    10 
    11 Problems:
    12 
    13 There may be no CPU shortfall, but GPU is idle
    14 
    15 If GPU is idle, we should get work from a project that potentially has jobs for it.
    16 
    17 If the project has both CPU and GPU jobs, we may need to tell to send only GPU jobs.
    18 
    19 LTD isn't meaningful with GPUs
    20 
    21 == New policy ==
    22 
    23 {{{
    24 A CPU job is one that uses only CPU time
    25 A CUDA job is one that uses CUDA (and may use CPU as well)
     3== Problems with the current work fetch policy ==
     4
     5The current work-fetch policy is essentially:
     6 * Do a weighted round-robin simulation, computing overall CPU shortfall
     7 * If there's a shortfall, request work from the project with highest LTD
     8
     9The scheduler request has a single number "work_req_seconds"
     10indicating the total duration of jobs being requested.
     11
     12This policy has various problems.
     13
     14 * There's no way for the client to say "I have N idle CPUs, so send me enough jobs to use them all".
     15
     16And many problems related to GPUs:
     17
     18 * There may be no CPU shortfall, but GPUs are idle; no work will be fetched.
     19
     20 * If a GPU is idle, we should get work from a project that potentially has jobs for it.
     21
     22 * If a project has both CPU and GPU jobs, we may need to tell it to send only GPU (or only CPU) jobs.
     23
     24 * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningly comparison between projects that use only GPUs, or between a GPU project and a CPU project.
     25
     26This document proposes a work-fetch system that solves these problems.
     27
     28For simplicity, the design assumes that there is only one GPU time (CUDA).
     29It is straightforward to extend the design to handle additional GPU types.
     30
     31== Terminology ==
     32
     33A job sent to a client is associated with an app version,
     34which uses some number (possibly fractional) of CPUs and CUDA devices.
     35
     36 * A '''CPU job''' is one that uses only CPU.
     37 * A '''CUDA job''' is one that uses CUDA (and may use CPU as well).
     38
     39== Scheduler request ==
     40
     41New fields in scheduler request message:
     42
     43'''double cpu_req_seconds''': number of CPU seconds requested
     44
     45'''double cuda_req_seconds''': number of CUDA seconds requested
     46
     47'''double ninstances_cpu''': send enough jobs to occupy this many CPUs
     48
     49'''double ninstances_cuda''': send enough jobs to occupy this many CUDA devs
     50
     51For compatibility with old servers, the message still has '''work_req_seconds''';
     52this is the max of (cpu,cuda)_req_seconds.
     53
     54== Client ==
     55
     56New abstraction: '''processing resource''' or PRSC.
     57There are two processing resource types: CPU and CUDA.
     58
     59Each PRSC has its own
     60
    2661----------------------
    2762RESOURCE_WORK_FETCH