Changes between Version 10 and Version 11 of GpuWorkFetch


Ignore:
Timestamp:
Dec 26, 2008, 11:45:40 AM (15 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GpuWorkFetch

    v10 v11  
    44
    55The current work-fetch policy is essentially:
    6  * Do a weighted round-robin simulation, computing overall CPU shortfall
    7  * If there's a shortfall, request work from the project with highest LTD
     6 * Do a weighted round-robin simulation, computing the CPU shortfall (i.e., the idle CPU time we expect during the work-buffering period).
     7 * If there's a CPU shortfall, request work from the project with highest long-term debt (LTD).
    88
    99The scheduler request has a scalar "work_req_seconds"
    1010indicating the total duration of jobs being requested.
    1111
    12 This policy has various problems.
     12This policy has various problems.  First:
    1313
    1414 * There's no way for the client to say "I have N idle CPUs; send me enough jobs to use them all".
    1515
    16 And many problems related to GPUs:
    17 
    18  * There may be no CPU shortfall, but GPUs are idle; no work will be fetched.
     16Problems related to GPUs:
     17
     18 * If there is no CPU shortfall, no work will be fetched even if GPUs are idle.
    1919
    2020 * If a GPU is idle, we should get work from a project that potentially has jobs for it.
    2121
    22  * If a project has both CPU and GPU jobs, we may need to tell it to send only GPU (or only CPU) jobs.
    23 
    24  * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningly comparison between projects that use only GPUs, or between a GPU project and a CPU project.
    25 
    26 This document proposes a work-fetch system that solves these problems.
    27 
    28 For simplicity, the design assumes that there is only one GPU time (CUDA).
     22 * If a project has both CPU and GPU jobs, the client should be able to tell it to send only GPU (or only CPU) jobs.
     23
     24 * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.
     25
     26This document proposes a modification to the work-fetch system that solves these problems.
     27
     28For simplicity, the design assumes that there is only one GPU type (CUDA).
    2929It is straightforward to extend the design to handle additional GPU types.
    3030
     
    3939== Scheduler request ==
    4040
    41 New fields in scheduler request message:
     41New fields in the scheduler request message:
    4242
    4343'''double cpu_req_seconds''': number of CPU seconds requested
     
    5454== Client ==
    5555
     56
    5657New abstraction: '''processing resource''' or PRSC.
    5758There are two processing resource types: CPU and CUDA.
    5859
     60=== Per-resource-type backoff
     61
     62We need to handle the situation where there's a GPU shortfall
     63but no projects are supplying GPU work
     64(for either permanent or transient reasons).
     65We don't want an overall work-fetch backoff from those projects.
     66Instead, we maintain a separate backoff timer per (project, PRSC).
     67This is doubled whenever we ask for only work of that type and don't get any;
     68it's cleared whenever we get a job of that type.
     69
     70==- Work-fetch state ==
     71
    5972Each PRSC has its own set of data related to work fetch.
    6073This is stored in an object of class PRSC_WORK_FETCH.
    6174
    62 Its data members are:
     75Data members of PRSC_WORK_FETCH:
    6376
    6477'''double shortfall''': shortfall for this resource
    6578'''double max_nidle''': number of idle instances
    6679
    67 Its member functions are:
     80Member functions of PRSC_WORK_FETCH:
    6881
    6982'''clear()''': called at the start of RR simulation
     
    109122update P's LTD
    110123}}}
    111 
    112124
    113125'''accumulate_shortfall(dt, i, n)''':
     
    121133}}}
    122134
    123 
    124135Each PRSC also needs to have some per-project data.
    125136This is stored in an object of class PRSC_PROJECT_DATA.
     
    145156'''double long_term_debt*'''
    146157
    147 === Per-resource-type backoff
    148 
    149 We need to handle the situation where there's a GPU shortfall
    150 but no projects are supplying GPU work
    151 (for either permanent or transient reasons).
    152 We don't want an overall backoff from those projects.
    153 Insteac, we maintain separate backoff timer per PRSC.
    154158
    155159=== debt accounting ===
     
    167171   cuda_work_fetch.accumulate_shortfall(dt)
    168172}}}
    169 
    170173
    171174=== Work fetch ===