Changes between Version 36 and Version 37 of CreditNew
- Timestamp:
- May 11, 2012, 9:16:19 AM (13 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CreditNew
v36 v37 2 2 3 3 == Terminology == 4 5 * The '''runtime''' (or '''elapsed time''') of a job is the 6 amount of time it runs. 7 * '''FLOPs''' (lower case s) means number of floating-point operations. 8 * '''FLOPS''' (upper case S) means FLOPs per second. 4 9 5 10 BOINC estimates the '''peak FLOPS''' of each processor. … … 21 26 * For our purposes, the peak FLOPS of a device 22 27 is based on single or double precision, whichever is higher. 23 24 == Credit system goals == 25 26 Some goals in designing a credit system: 27 * Device neutrality: similar jobs should get similar credit 28 regardless of what processor or GPU they run on. 29 * Project neutrality: different projects should grant 30 about the same amount of credit per host, averaged over all hosts. 31 * Gaming-resistance: there should be a bound on the 32 impact of faulty or malicious hosts. 28 * BOINC's estimate of the peak FLOPS of a device may be wrong, 29 e.g. because the manufacturer's formula is incomplete or wrong. 33 30 34 31 == The first credit system == … … 90 87 (This means that projects with efficient GPU apps will 91 88 grant more credit than projects with inefficient apps. That's OK). 89 * Cheat-resistance. 92 90 93 91 == ''A priori'' job size estimates and bounds == … … 95 93 For each job, the project supplies 96 94 * an estimate of the FLOPs used by a job (wu.fpops_est) 97 * a limit on FLOP S, after which the job will be aborted95 * a limit on FLOPs, after which the job will be aborted 98 96 (wu.fpops_bound). 99 97 … … 104 102 Averages of FLOP count and elapsed time 105 103 are normalized by fpops_est (see below), 106 and if fpops_est is correlated with actual size,104 and if fpops_est is correlated with runtime, 107 105 these averages will converge more quickly. 108 106 … … 122 120 based on the resources used by the job and their peak speeds. 123 121 124 When the job is finished inelapsed time T,122 When a client finishes a job and reports its elapsed time T, 125 123 we define peak_flop_count(J), or PFC(J) as 126 124 … … 175 173 is above a '''sample threshold'''. 176 174 177 == Data == 178 179 We maintain the following estimates: 180 181 app.min_avg_pfc:: an estimate of the average actual FLOPS for the app 182 (normalized by wu.fpops_est) 183 app_version.pfc_avg:: the average of PFC(J)/wu.fpops_est for an app version. 184 app_version.pfc_scale:: a PFC scale factor for the app version 175 == Statistics maintained by the server == 176 177 The server maintains the following statistics: 178 185 179 host_app_version.pfc_avg:: for each app version V and host H, 186 180 the average of PFC(J)/wu.fpops_est for jobs completed by H using A. 187 host_app_version.scale_probation:: 188 if set, the host is suspected of cherry-picking (see below) 189 and we don't use host normalization 181 app_version.pfc_avg:: the average of PFC(J)/wu.fpops_est for all jobs 182 completed by the app version. 190 183 191 184 == Sanity check == 192 185 193 If PFC(J) is infinite or is> wu.fpops_bound,194 J is assigned a "default PFC" D and other processing is skipped.186 If PFC(J) is > wu.fpops_bound, 187 J is assigned a "default PFC" D and it's not used to update statistics. 195 188 D is determined as follows: 196 189 … … 202 195 203 196 D = wu.fpops_est 204 205 We also set host_app_version.scale_probation to true206 (ensuring that the host scale factor isn't used for a while)207 and host_app_version.error_rate to an initial value208 (ensuring that jobs sent to this host are replicated for a while).209 197 210 198 == Cross-version normalization == … … 243 231 244 232 Notes: 245 * Doesn't host normalization (see below) subsume version normalization?246 Not if there are both CPU and GPU versions, because of the "min".247 233 * Version normalization is only applied if at least two 248 234 versions are above sample threshold. … … 260 246 One solution is to create separate apps for separate types of jobs. 261 247 * Cheating or erroneous hosts can influence app_version.pfc_avg to some extent. 262 This is limited by the Sanity Checkmechanism,248 This is limited by the "sanity check" mechanism, 263 249 and by the fact that only validated jobs are used. 264 250 The effect on credit will be negated by host normalization … … 277 263 278 264 app_version.pfc_avg / host_app_version.pfc_avg 265 266 This scaling is only done if both statistics are above sample threshold. 279 267 280 268 There are some cases where hosts are not sent jobs uniformly: … … 309 297 If app.min_avg_pfc is defined, 310 298 host_app_version.pfc_avg is above sample threshold, 311 and host_app_version.scale_probation is not set,312 299 we normalize PFC by the factor 313 300 … … 562 549 (from which job durations are estimated). 563 550 564 == Job runtime estimates ==565 566 Unrelated to the credit proposal, but in a similar spirit.567 The server will maintain host_app_version.et,568 the statistics (mean and variance) of569 job runtimes (normalized by wu.fpops_est) per570 host and application version.571 572 The server's estimate of a job's runtime is then573 574 R(J, H) = wu.fpops_est * host_app_version.et.avg575 576 551 == Implementation == 577 552 … … 653 628 * If we're the "main feeder" (mod = 0, or mod not used), 654 629 update app_version.pfc_scale and app.min_avg_pfc every 10 minutes. 630