Changes between Version 32 and Version 33 of CreditNew
- Timestamp:
- Mar 26, 2010, 3:23:57 PM (15 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CreditNew
v32 v33 20 20 Notes: 21 21 * For our purposes, the peak FLOPS of a device 22 usessingle or double precision, whichever is higher.22 is based on single or double precision, whichever is higher. 23 23 24 24 == Credit system goals == … … 113 113 They aren't cheat-proof, and we don't use them. 114 114 115 == Peak FLOP Count (PFC)==115 == Peak FLOP Count == 116 116 117 117 This system uses the Peak-FLOPS-based approach, … … 127 127 PFC(J) = T * peak_flops(J) 128 128 129 The credit for a job J is typically proportional to PFC(J), 130 but is limited and normalized in various ways. 131 129 132 Notes: 130 133 131 * PFC(J) is not cheat-proof;134 * PFC(J) is not reliable; 132 135 cheaters can falsify elapsed time or device attributes. 133 136 * We use elapsed time instead of actual device time (e.g., CPU time). … … 147 150 in the trickle message. 148 151 149 By default, the credit for a job J is proportional to PFC(J),150 but is limited and normalized in the following ways:151 152 152 == Computing averages == 153 153 … … 160 160 and we need to track this. 161 161 This done as follows: for the first N samples 162 (N = ~100 for app versions, ~10 for hosts)163 162 we take the straight average. 164 After that we use an exponentially-weighted average 165 (with appropriate parameter for app version and host) 166 * A given sample may be wildly off, 167 and we can't let this mess up the average. 168 Samples after the first are capped at 10 times the current average. 163 After that we use an exponentially-weighted average with parameter A. 164 The choice of N and A depends on the entity involved; 165 for app versions (which typically get thousands of jobs per day) 166 we might use N=100 and A=.001. 167 For hosts (which typically get a few jobs per day) 168 we might use N=10 and A=.01. 169 * To reduce the effect of erroneously huge samples, 170 samples after the first are capped at X times the current average. 171 X depends on the entity: 172 maybe 10 for hosts, 100 for app versions. 169 173 * We keep track of the number of samples, 170 174 and use an average only if its number of samples … … 175 179 We maintain the following estimates: 176 180 177 app.min_avg_pfc:: an estimate of the average actual FLOPS for anapp181 app.min_avg_pfc:: an estimate of the average actual FLOPS for the app 178 182 (normalized by wu.fpops_est) 179 183 app_version.pfc_avg:: the average of PFC(J)/wu.fpops_est for an app version. 184 app_version.pfc_scale:: a PFC scale factor for the app version 180 185 host_app_version.pfc_avg:: for each app version V and host H, 181 186 the average of PFC(J)/wu.fpops_est for jobs completed by H using A. 187 host_app_version.scale_probation:: 188 if set, the host is suspected of cherry-picking (see below) 189 and we don't use host normalization 182 190 183 191 == Sanity check == 184 192 185 193 If PFC(J) is infinite or is > wu.fpops_bound, 186 J is assigned a "default PFC" and other processing is skipped.187 D efault PFCis determined as follows:194 J is assigned a "default PFC" D and other processing is skipped. 195 D is determined as follows: 188 196 189 197 * If app.min_avg_pfc is defined then … … 194 202 195 203 D = wu.fpops_est 204 205 We also set host_app_version.scale_probation to true 206 (ensuring that the host scale factor isn't used for a while) 207 and host_app_version.error_rate to an initial value 208 (ensuring that jobs sent to this host are replicated for a while). 196 209 197 210 == Cross-version normalization == … … 200 213 (e.g., CPU, multi-thread, and GPU versions). 201 214 If jobs are distributed uniformly to versions, 202 all versions should get the same average credit. 203 We adjust the credit per job 204 so that the average is the same for each version. 215 all versions should get the same average granted credit. 216 To make this so, we scale PFC as follows. 205 217 206 218 For each app, we periodically compute cpu_pfc … … 228 240 229 241 app.min_avg_pfc = app_version.pfc_avg 242 app_version.pfc_scale = 1 230 243 231 244 Notes: … … 246 259 then this mechanism doesn't work as intended. 247 260 One solution is to create separate apps for separate types of jobs. 248 * Cheating or erroneous hosts can influence PFC^mean^(V)to some extent.261 * Cheating or erroneous hosts can influence app_version.pfc_avg to some extent. 249 262 This is limited by the Sanity Check mechanism, 250 263 and by the fact that only validated jobs are used. 251 264 The effect on credit will be negated by host normalization 252 265 (see below). 253 There may be an effect on cross-version normalization.254 This could be eliminated by computing PFC^mean^(V)255 as the sample-median value of PFC^mean^(H, V) (see below).266 There may be an adverse effect on cross-version normalization. 267 This could be eliminated by computing app_version.pfc_avg 268 as the sample-median value of host_app_version.pfc_avg 256 269 257 270 == Host normalization == … … 261 274 Then the average credit per job should be the same for all hosts. 262 275 263 We scale PFC by the factor276 To achieve this, we scale PFC by the factor 264 277 265 278 app_version.pfc_avg / host_app_version.pfc_avg … … 271 284 jobs to GPUs with more processors. 272 285 273 The normalization by wu.fpops_est handles this. 286 The normalization by wu.fpops_est handles this 287 (assuming that it's set correctly). 274 288 275 289 Notes: 276 290 * For apps with large variance of job sizes, 277 the host normalization mechanism is prone to291 the host normalization mechanism is vulnerable to 278 292 a type of cheating called "cherry picking". 279 293 A mechanism for defeating this is described below. … … 290 304 and it keeps track of PFC and elapsed time statistics there. 291 305 There are separate records per resource type. 292 The app_version_id encodes the app ID and the resource type306 The record's app_version_id encodes the app ID and the resource type 293 307 (-2 for CPU, -3 for NVIDIA GPU, -4 for ATI). 294 308 295 If app.min_avg_pfc is defined and309 If app.min_avg_pfc is defined, 296 310 host_app_version.pfc_avg is above sample threshold, 311 and host_app_version.scale_probation is not set, 297 312 we normalize PFC by the factor 298 313 … … 309 324 Notes: 310 325 311 * We don't assume that anonymous platform apps on 312 different hosts but with the same platform and resource type 313 are comparable. 326 * In the current design, anonymous platform jobs don't 327 contributed to app.min_avg_pfc, 328 but it may be used to determine their credit. 329 This may cause problems: 330 e.g., suppose a project offers an inefficient version 331 and volunteers make a much more efficient version 332 and run it anonymous platform. 333 They'd get an unfair amount of credit. 334 This could be fixed by creating app_version records 335 representing all anonymous platform apps of a given 336 platform and resource type. 314 337 315 338 == Summary == … … 327 350 approx = true; 328 351 if pfc > wu.fpops_bound 352 host_app_version.scale_probation = true 353 host_app_version.error_rate = initial value // replicate for a while 329 354 if app.min_avg_pfc is defined 330 355 F = app.min_avg_pfc * wu.fpops_est … … 333 358 else 334 359 if job is anonymous platform 335 360 if app.min_avg_pfc is defined 336 361 if host_app_version.pfc_avg is above sample threshold 337 approx = false 338 F = app.min_avg_pfc / host_app_version.pfc_avg 339 else 340 F = app.min_avg_pfc * wu.fpops_est 362 and not host_app_version.scale_probation 363 F = app.min_avg_pfc / host_app_version.pfc_avg 364 approx = false 365 else 366 F = app.min_avg_pfc * wu.fpops_est 341 367 else 342 368 F = wu.fpops_est 343 369 else 344 370 F = pfc; 345 if Scale(V) is defined 346 F *= Scale(V) 347 if Scale(H, V) is defined and (H,V) is not on scale probation 348 F *= Scale(H, V) 371 host_scale = 0 372 if host_app_version.pfc_avg is above sample threshold 373 and not host_app_version.scale_probation 374 host_scale = min(10, app_version.pfc_avg / host_app_version.pfc_avg) 375 if app_version.pfc_scale is defined 376 F *= app_version.pfc_scale 377 if host_scale 378 F *= host_scale 379 approx = false 380 else 381 if host_scale 382 F *= host_scale 383 app_version.pfc_avg.update(F) 384 host_app_version.pfc_avg.update(F) 349 385 }}} 350 386 … … 353 389 The claimed credit of a job (in Cobblestones) is 354 390 355 C = F * 200/86400e9 356 391 C = F * cobblestone_scale 392 393 where cobblestone_scale is 200/86400e9. 357 394 If replication is not used, this is the granted credit. 358 395 … … 364 401 {{{ 365 402 if app.min_avg_pfc is defined 366 C = app.min_avg_pfc*wu.fpops_est 403 C = app.min_avg_pfc*wu.fpops_est*cobblestone_scale 367 404 else 368 C = wu.fpops_est * 200/86400e9405 C = wu.fpops_est * cobblestone_scale 369 406 }}} 370 407 … … 421 458 by claiming excessive credit 422 459 (i.e., by falsifying benchmark scores or elapsed time). 423 An exaggerated claim will increase PFC^mean^(H,A),460 An exaggerated claim will increase host_app_version.pfc_avg, 424 461 causing subsequent credit to be scaled down proportionately. 425 462 … … 434 471 For example, claiming a PFC of 1e304. 435 472 436 If PFC(J) exceeds some multiple (say, 20) of PFC^mean^(V), 437 the host's error rate is set to the initial value, 438 so it won't do single replication for a while, 439 and scale_probation (see below) is set to true. 440 441 == Cherry picking == 473 This is handled by the sanity check mechanism, 474 which grants a default amount of credit 475 and treats the host with suspicion for a while. 476 477 === Cherry picking === 442 478 443 479 Suppose an application has a mix of long and short jobs. … … 471 507 and now < host_scale_time, don't use the host scale factor 472 508 473 The idea is to applythe host scaling factor509 The idea is to use the host scaling factor 474 510 only if there's solid evidence that the host is NOT cherry picking. 475 511 … … 534 570 {{{ 535 571 int host_id; 536 int app_version_id; 572 int app_version_id; // generalized for anon platform 537 573 AVERAGE pfc; 538 AVERAGE_VAR et; 574 AVERAGE_VAR et; // elapsed time / wu.fpops_est 539 575 double host_scale_time; 540 576 bool scale_probation; … … 556 592 {{{ 557 593 double min_avg_pfc; 558 bool host_scale_check; 594 bool host_scale_check; // whether to do scale probation 559 595 int max_jobs_in_progress; 560 596 int max_gpu_jobs_in_progress;