Changes between Version 6 and Version 7 of ClientSchedOctTen
- Timestamp:
- Oct 28, 2010, 2:22:09 PM (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
ClientSchedOctTen
v6 v7 22 22 and typically work is requested for all processor types. 23 23 24 These policies fail to meet their goals in many cases. 24 === Resource shares not enforced == 25 26 These policies may fail to enforce resource shares. 25 27 Here are two scenarios that illustrate the underlying problems: 26 28 27 === Example 1===29 ==== Example 1 ==== 28 30 29 31 A host has a fast GPU and a slow CPU. … … 37 39 and project A gets 100% of the GPU. 38 40 39 === Example 2===41 ==== Example 2 ==== 40 42 41 43 Same host. … … 47 49 All information about the relative debt of B and C is lost. 48 50 49 === T he bottom line===51 === Too many scheduler requests === 50 52 51 The current approach - extending the STD/LTD model to multiple processor types - 52 seemed good at the time but turns out to be the wrong way. 53 The work fetch mechanism has two buffer sizes: min and max. 54 The current work fetch policy requests work whenever 55 the buffer goes below max. 56 This can cause the client to issue frequent small work requests. 57 58 The original intention of min and max is that they are hysteresis limits: 59 when the buffer goes below min, 60 the client tries to fetch enough work to increase the buffer to max, 61 and then issues no further requests until the buffer falls below min again. 62 In this way, the interval between scheduler RPCs to a given project 63 is at least (max-min). 53 64 54 65 == Proposal: credit-driven scheduling == 55 66 56 The idea is to make resource share apply to credit. 67 The idea is to make resource share apply to overall credit, 68 not to individual resources types. 57 69 If two projects have the same resource share, they should have the same RAC. 58 70 Scheduling decisions should give preference to projects … … 66 78 * Jobs may fail to get credit, e.g. because they don't validate. 67 79 68 Hence we will use a surrogate called '''estimated credit''' 69 that ismaintained by the client.80 Hence we will use a surrogate called '''estimated credit''', 81 maintained by the client. 70 82 If projects grant credit fairly, and if all jobs validate, 71 83 then estimated credit is roughly equal to granted credit over the long term. 72 84 73 Note: there is a potential advantage to using granted credit too.74 Doing so penalizes projects that grant inflated credit:85 Note: there is a potential advantage to using granted credit: 86 doing so penalizes projects that grant inflated credit: 75 87 the more credit a project grants, the less work a given host 76 88 will do for it, assuming the host is attached to multiple projects. … … 95 107 with an averaging half-life of, say, a month. 96 108 97 === Work fetch === 109 The '''scheduling priority''' of a project P is then 110 {{{ 111 SP(P) = share(P) - REC(P) 112 }}} 113 where REC is normalized so that it sums to 1. 98 114 99 The work fetch policy also needs to take into account 100 the amount of work currently queued for a project, 101 so that it doesn't keep getting work from the same project. 102 To accomplish this, we define Q(P) as the number of FLOPS currently queued for P, 103 normalized so that the sum over projects is 1. 115 In scenario 1 above, 116 SP(A) will always be negative and 117 SP(B) will always be positive. 104 118 105 We then define the work-fetch priority of a project as 106 {{{ 107 WFP(P) = share(P) - (RAF(P) + A*Q(P))/(1+A) 108 }}} 109 110 where A is a parameter, probably around 1. 111 112 === Job scheduling === 119 == Job scheduling == 113 120 114 121 As the job scheduling policy picks jobs to run (e.g. on a multiprocessor) 115 122 it needs to take into account the jobs already scheduled, 116 123 so that it doesn't always schedule multiple jobs from the same project. 117 To accomplish this, as each job is scheduled we update 118 RAF(P) as if the job had run for one scheduling period. 119 The job-scheduling priority is then 120 {{{ 121 JSP(P) = share(P) - B*RAF(P) 122 }}} 123 where B is a parameter, probably around 1. 124 To accomplish this, as each job is scheduled we increment 125 a copy of REC(P) as if the job had run for one scheduling period. 126 127 == Work fetch == 128 129 The proposed work fetch policy: 130 131 * Work fetch for a given processor type 132 is initiated whenever the saturated period is less than min. 133 * ask the fetchable project with greatest SP(P) for "shortfall" seconds of work. 134 * whenever a scheduler RPC to project P is done 135 (e.g. to report results) 136 and SP(P) is greatest among fetchable projects for a given processor type, 137 request "shortfall" seconds of that type 138 139 Notes: 140 141 * This will tend to get large (max-min) clumps of work for 142 a single project, 143 and variety will be lower than the current policy. 144