Changes between Version 24 and Version 25 of CreditNew
- Timestamp:
- Mar 10, 2010, 9:02:47 AM (15 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CreditNew
v24 v25 108 108 109 109 When a job J is issued to a host, 110 the scheduler specifies flops_est(J), 111 a FLOPS estimate based on the resources used by the job 112 and their peak speeds. 110 the scheduler computes peak_flops(J) 111 based on the resources used by the job and their peak speeds. 113 112 114 113 If the job is finished in elapsed time T, 115 114 we define peak_flop_count(J), or PFC(J) as 116 115 {{{ 117 PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D))116 PFC(J) = T * peak_flops(J) 118 117 }}} 119 118 120 119 Notes: 121 120 122 * PFC(J) is123 121 * We use elapsed time instead of actual device time (e.g., CPU time). 124 122 If a job uses a resource inefficiently … … 127 125 The key thing is that BOINC reserved the device for the job, 128 126 whether or not the job used it efficiently. 129 * usage(J,D) may not be accurate; e.g., a GPU job may take127 * peak_flops(J) may not be accurate; e.g., a GPU job may take 130 128 more or less CPU than the scheduler thinks it will. 131 129 Eventually we may switch to a scheme where the client … … 134 132 * For projects (CPDN) that grant partial credit via 135 133 trickle-up messages, substitute "partial job" for "job". 136 These projects must include elapsed time ,137 app version ID, and FLOPS estimatein the trickle message.138 139 The grantedcredit for a job J is proportional to PFC(J),134 These projects must include elapsed time and result ID 135 in the trickle message. 136 137 The credit for a job J is proportional to PFC(J), 140 138 but is normalized in the following ways: 141 139 140 == ''A priori'' job size estimates == 141 142 If we have an ''a priori'' estimate of job size, 143 we can normalize by this to reduce the variance 144 of various distributions (see below). 145 This makes estimates of the means converge more quickly. 146 147 We'll use workunit.rsc_fpops_est as this a priori estimate, 148 and we'll denote it E(J). 149 150 ''A posteriori'' estimates of job size may exist also 151 (e.g., an iteration count reported by the app) 152 but using this for anything introduces a new cheating risk, 153 so it's probably better not to. 154 142 155 == Cross-version normalization == 143 156 144 157 If a given application has multiple versions (e.g., CPU and GPU versions) 145 the grantedcredit per job is adjusted158 the credit per job is adjusted 146 159 so that the average is the same for each version. 147 160 148 We maintain the average PFC^mean^(V) of PFC( ) for each app version V.161 We maintain the average PFC^mean^(V) of PFC(J)/E(J) for each app version V. 149 162 We periodically compute PFC^mean^(CPU) and PFC^mean^(GPU), 150 163 and let X be the min of these. … … 153 166 S(V) = (X/PFC^mean^(V)) 154 167 155 The result for a given job J 156 is called "Version-Normalized Peak FLOP Count", or VNPFC(J): 168 The "Version-Normalized Peak FLOP Count", or VNPFC(J) is 157 169 158 170 VNPFC(J) = S(V) * PFC(J) 159 171 160 172 Notes: 161 * Thisaddresses the common situation173 * Version normalization addresses the common situation 162 174 where an app's GPU version is much less efficient than the CPU version 163 175 (i.e. the ratio of actual FLOPs to peak FLOPs is much less). … … 167 179 It's not exactly "Actual FLOPs", since the most efficient 168 180 version may not be 100% efficient. 169 * There are two sources of variance in PFC(V):170 the variation in host efficiency,171 and possibly the variation in job size.172 If we have an ''a priori'' estimate of job size173 (e.g., workunit.rsc_fpops_est)174 we can normalize by this to reduce the variance,175 and make PFC^mean^(V) converge more quickly.176 * ''a posteriori'' estimates of job size may exist also177 (e.g., an iteration count reported by the app)178 but using this for anything introduces a new cheating risk,179 so it's probably better not to.180 181 181 182 == Cross-project normalization == 182 183 183 184 If an application has both CPU and GPU versions, 184 then the version normalization mechanism uses the CPU 185 version as a "sanity check" to limit the credit granted to GPU jobs. 185 the version normalization mechanism uses the CPU 186 version as a "sanity check" to limit the credit granted to GPU jobs 187 (or vice versa). 186 188 187 189 Suppose a project has an app with only a GPU version, … … 194 196 then for each version V we let 195 197 S(V) be the average scaling factor 196 for that plan class among projects that dohave both CPU and GPU versions.198 for that resource type among projects that have both CPU and GPU versions. 197 199 This factor is obtained from a central BOINC server. 198 200 V's jobs are then scaled by S(V) as above. … … 200 202 Notes: 201 203 202 * wu use plan class, 203 since e.g. the average efficiency of CUDA 2.3 apps may be different 204 than that of CUDA 2.1 apps. 205 * Initially we'll obtain scaling factors from large projects 206 that have both GPU and CPU apps (e.g., SETI@home). 207 Eventually we'll use an average (weighted by work done) 208 over multiple projects (see below). 204 * The "average scaling factor" is weighted by work done. 209 205 210 206 == Host normalization == … … 214 210 Then the average credit per job should be the same for all hosts. 215 211 To ensure this, for each app version V and host H 216 we maintain PFC^mean^(H, A). 212 we maintain PFC^mean^(H, A), 213 the average of PFC(J)/E(J) for jobs completed by H using A. 217 214 The '''claimed FLOPS''' for a given job J is then 218 215 … … 228 225 * GPUGrid.net's scheme for sending some (presumably larger) 229 226 jobs to GPUs with more processors. 230 In these cases average credit per job must differ between hosts, 231 according to the types of jobs that are sent to them. 232 233 This can be done by dividing 234 each sample in the computation of PFC^mean^ by WU.rsc_fpops_est 235 (in fact, there's no reason not to always do this). 227 228 The normalization by E(J) handles this 229 (assuming that wu.fpops_est is set appropriately). 236 230 237 231 Notes: … … 240 234 and increases the claimed credit of hosts that are more efficient 241 235 than average. 242 * PFC^mean^ is averaged over jobs, not hosts.243 236 244 237 == Computing averages == … … 247 240 248 241 * The quantities being averaged may gradually change over time 249 (e.g. average job size may change, 250 app version efficiency may change as new versions are deployed) 242 (e.g. average job size may change) 251 243 and we need to track this. 252 244 * A given sample may be wildly off, 253 245 and we can't let this mess up the average. 254 246 255 In addition, we may as well maintain the variance of the quantities, 256 although the current system doesn't use it. 257 258 The code that does all this is 247 The code that does this is 259 248 [http://boinc.berkeley.edu/trac/browser/trunk/boinc/lib/average.h here]. 260 249 … … 281 270 and sets their scaling factor based on the above. 282 271 272 == Anonymous platform == 273 274 For anonymous platform apps, since we don't reliably 275 know anything about the devices involved, 276 we don't try to estimate PFC. 277 278 For each app, we maintain claimed_credit^mean^(A), 279 the average of claimed_credit(J)/E(J). 280 281 The claimed credit for anonymous platform jobs is 282 283 claimed_credit^mean^(A)*E(J) 284 285 The server maintains host_app_version records for anonymous platform, 286 and it keeps track of elapsed time statistics there. 287 These have app_version_id = -1 for CPU, -2 for NVIDIA GPU, -3 for ATI. 288 289 == Replication == 290 291 We take the set of hosts that 292 are not anon platform and not on scale probation (see below). 293 If this set is nonempty, we grant the average of their claimed credit. 294 Otherwise we grant 295 296 claimed_credit^mean^(A)*E(J) 297 283 298 == Cheat prevention == 284 299 … … 286 301 by claiming excessive credit 287 302 (i.e., by falsifying benchmark scores or elapsed time). 288 An exaggerated claim will increase VNPFC*(H,A),289 causing subsequent c laimed credit to be scaled down proportionately.303 An exaggerated claim will increase PFC^mean^(H,A), 304 causing subsequent credit to be scaled down proportionately. 290 305 291 306 This means that no special cheat-prevention scheme … … 293 308 in this case, granted credit = claimed credit. 294 309 295 For jobs that are replicated,296 granted credit is set to:297 * if the larger host is on scale probation, the smaller298 * if larger > 2*smaller, granted = 1.5*smaller299 * else granted = (larger+smaller)/2300 301 310 However, two kinds of cheating still have to be dealt with: 302 311 … … 304 313 305 314 For example, claiming a PFC of 1e304. 306 This can be minimized by 307 capping VNPFC(J) at some multiple (say, 20) of VNPFC^mean^(A). 308 If this is enforced,the host's error rate is set to the initial value,315 316 If PFC(J) exceeds some multiple (say, 20) of PFC^mean^(V), 317 the host's error rate is set to the initial value, 309 318 so it won't do single replication for a while, 310 319 and scale_probation (see below) is set to true. … … 351 360 In addition, to limit the extent of cheating 352 361 (in case the above mechanism is defeated somehow) 353 the host scaling factor will be min'd with a 354 project-wide config parameter (default, say, 3). 355 356 == Trickle credit == 357 358 CPDN breaks jobs into segments, 359 has the client send a trickle-up message on completion of each segment, 360 and grants credit in the trickle-up handler. 361 362 In this case, the trickle-up message should include 363 the incremental elapsed time of the the segment. 364 The trickle-up handler should then call {{{compute_claimed_credit()}}} 365 (see below) to determine the claimed credit. 366 In this case segments play the role of jobs in the credit-related DB fields. 362 the host scaling factor will be min'd with a constant (say, 3). 367 363 368 364 == Error rate, host punishment, and turnaround time estimation == … … 390 386 so we'll move "max_results_day" from the host table to host_app_version. 391 387 392 == Anonymous platform ==393 394 For anonymous platform apps, since we don't necessarily395 know anything about the devices involved,396 we don't try to estimate PFC.397 Instead, we give the average credit for the app,398 scaled by the job size.399 400 The server maintains host_app_version record for anonymous platform,401 and it keeps track of elapsed time statistics there.402 These have app_version_id = -1 for CPU, -2 for NVIDIA GPU, -3 for ATI.403 404 388 == App plan functions == 405 389 … … 479 463 480 464 465