Context Navigation

Changes between Version 10 and Version 11 of GpuWorkFetch

Timestamp:: Dec 26, 2008, 11:45:40 AM (17 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GpuWorkFetch

-                      v10
+                      v11
 The current work-fetch policy is essentially:
  * Do a weighted round-robin simulation, computing overall CPU shortfall
  * If there's a shortfall, request work from the project with highest LTD
+ * Do a weighted round-robin simulation, computing the CPU shortfall (i.e., the idle CPU time we expect during the work-buffering period).
+ * If there's a CPU shortfall, request work from the project with highest long-term debt (LTD).
 The scheduler request has a scalar "work_req_seconds"
 indicating the total duration of jobs being requested.
 This policy has various problems.
+This policy has various problems.  First:
  * There's no way for the client to say "I have N idle CPUs; send me enough jobs to use them all".
 And many problems related to GPUs:
  * There may be no CPU shortfall, but GPUs are idle; no work will be fetched.
+Problems related to GPUs:
+ * If there is no CPU shortfall, no work will be fetched even if GPUs are idle.
  * If a GPU is idle, we should get work from a project that potentially has jobs for it.
  * If a project has both CPU and GPU jobs, we may need to tell it to send only GPU (or only CPU) jobs.
  * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningly comparison between projects that use only GPUs, or between a GPU project and a CPU project.
 This document proposes a work-fetch system that solves these problems.
 For simplicity, the design assumes that there is only one GPU time (CUDA).
+ * If a project has both CPU and GPU jobs, the client should be able to tell it to send only GPU (or only CPU) jobs.
+ * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.
+This document proposes a modification to the work-fetch system that solves these problems.
+For simplicity, the design assumes that there is only one GPU type (CUDA).
 It is straightforward to extend the design to handle additional GPU types.
 …
 == Scheduler request ==
 New fields in scheduler request message:
+New fields in the scheduler request message:
 '''double cpu_req_seconds''': number of CPU seconds requested
 …
 == Client ==
 New abstraction: '''processing resource''' or PRSC.
 There are two processing resource types: CPU and CUDA.
+=== Per-resource-type backoff
+We need to handle the situation where there's a GPU shortfall
+but no projects are supplying GPU work
+(for either permanent or transient reasons).
+We don't want an overall work-fetch backoff from those projects.
+Instead, we maintain a separate backoff timer per (project, PRSC).
+This is doubled whenever we ask for only work of that type and don't get any;
+it's cleared whenever we get a job of that type.
+==- Work-fetch state ==
 Each PRSC has its own set of data related to work fetch.
 This is stored in an object of class PRSC_WORK_FETCH.
 Its data members are:
+Data members of PRSC_WORK_FETCH:
 '''double shortfall''': shortfall for this resource
 '''double max_nidle''': number of idle instances
 Its member functions are:
+Member functions of PRSC_WORK_FETCH:
 '''clear()''': called at the start of RR simulation
 …
 update P's LTD
 }}}
 '''accumulate_shortfall(dt, i, n)''':
 …
 }}}
 Each PRSC also needs to have some per-project data.
 This is stored in an object of class PRSC_PROJECT_DATA.
 …
 '''double long_term_debt*'''
-=== Per-resource-type backoff
-We need to handle the situation where there's a GPU shortfall
-but no projects are supplying GPU work
-(for either permanent or transient reasons).
-We don't want an overall backoff from those projects.
-Insteac, we maintain separate backoff timer per PRSC.
 === debt accounting ===
 …
    cuda_work_fetch.accumulate_shortfall(dt)
 }}}
 === Work fetch ===