Changes between Version 10 and Version 11 of GpuWorkFetch
- Timestamp:
- Dec 26, 2008, 11:45:40 AM (16 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
GpuWorkFetch
v10 v11 4 4 5 5 The current work-fetch policy is essentially: 6 * Do a weighted round-robin simulation, computing overall CPU shortfall7 * If there's a shortfall, request work from the project with highest LTD6 * Do a weighted round-robin simulation, computing the CPU shortfall (i.e., the idle CPU time we expect during the work-buffering period). 7 * If there's a CPU shortfall, request work from the project with highest long-term debt (LTD). 8 8 9 9 The scheduler request has a scalar "work_req_seconds" 10 10 indicating the total duration of jobs being requested. 11 11 12 This policy has various problems. 12 This policy has various problems. First: 13 13 14 14 * There's no way for the client to say "I have N idle CPUs; send me enough jobs to use them all". 15 15 16 And many problems related to GPUs:17 18 * There may be no CPU shortfall, but GPUs are idle; no work will be fetched.16 Problems related to GPUs: 17 18 * If there is no CPU shortfall, no work will be fetched even if GPUs are idle. 19 19 20 20 * If a GPU is idle, we should get work from a project that potentially has jobs for it. 21 21 22 * If a project has both CPU and GPU jobs, we may needto tell it to send only GPU (or only CPU) jobs.23 24 * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaning ly comparison between projects that use only GPUs, or between a GPU project and a CPU project.25 26 This document proposes a work-fetch system that solves these problems.27 28 For simplicity, the design assumes that there is only one GPU t ime (CUDA).22 * If a project has both CPU and GPU jobs, the client should be able to tell it to send only GPU (or only CPU) jobs. 23 24 * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects. 25 26 This document proposes a modification to the work-fetch system that solves these problems. 27 28 For simplicity, the design assumes that there is only one GPU type (CUDA). 29 29 It is straightforward to extend the design to handle additional GPU types. 30 30 … … 39 39 == Scheduler request == 40 40 41 New fields in scheduler request message:41 New fields in the scheduler request message: 42 42 43 43 '''double cpu_req_seconds''': number of CPU seconds requested … … 54 54 == Client == 55 55 56 56 57 New abstraction: '''processing resource''' or PRSC. 57 58 There are two processing resource types: CPU and CUDA. 58 59 60 === Per-resource-type backoff 61 62 We need to handle the situation where there's a GPU shortfall 63 but no projects are supplying GPU work 64 (for either permanent or transient reasons). 65 We don't want an overall work-fetch backoff from those projects. 66 Instead, we maintain a separate backoff timer per (project, PRSC). 67 This is doubled whenever we ask for only work of that type and don't get any; 68 it's cleared whenever we get a job of that type. 69 70 ==- Work-fetch state == 71 59 72 Each PRSC has its own set of data related to work fetch. 60 73 This is stored in an object of class PRSC_WORK_FETCH. 61 74 62 Its data members are:75 Data members of PRSC_WORK_FETCH: 63 76 64 77 '''double shortfall''': shortfall for this resource 65 78 '''double max_nidle''': number of idle instances 66 79 67 Its member functions are:80 Member functions of PRSC_WORK_FETCH: 68 81 69 82 '''clear()''': called at the start of RR simulation … … 109 122 update P's LTD 110 123 }}} 111 112 124 113 125 '''accumulate_shortfall(dt, i, n)''': … … 121 133 }}} 122 134 123 124 135 Each PRSC also needs to have some per-project data. 125 136 This is stored in an object of class PRSC_PROJECT_DATA. … … 145 156 '''double long_term_debt*''' 146 157 147 === Per-resource-type backoff148 149 We need to handle the situation where there's a GPU shortfall150 but no projects are supplying GPU work151 (for either permanent or transient reasons).152 We don't want an overall backoff from those projects.153 Insteac, we maintain separate backoff timer per PRSC.154 158 155 159 === debt accounting === … … 167 171 cuda_work_fetch.accumulate_shortfall(dt) 168 172 }}} 169 170 173 171 174 === Work fetch ===