Changes between Version 29 and Version 30 of CreditNew
- Timestamp:
- Mar 26, 2010, 12:01:12 PM (15 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CreditNew
v29 v30 1 = New credit system design=2 3 == Definitions==1 = A new system for runtime estimation and credit = 2 3 == Terminology == 4 4 5 5 BOINC estimates the '''peak FLOPS''' of each processor. … … 7 7 For GPUs, it's given by a manufacturer-supplied formula. 8 8 9 Other factors, 10 such as the speed of a host's memory system, 9 Other factors, such as the speed of a host's memory system, 11 10 affect application performance. 12 11 So a given job might take the same amount of CPU time 13 and a 1 GFLOPS host as on a 10 GFLOPS host.12 on 1 GFLOPS and 10 GFLOPS hosts. 14 13 The '''efficiency''' of an application running on a given host 15 14 is the ratio of actual FLOPS to peak FLOPS. 16 15 17 GPUs typically have a much higher (50-100X) peak FLOPS than CPUs.16 GPUs typically have a higher (10-100X) peak FLOPS than CPUs. 18 17 However, application efficiency is typically lower 19 18 (very roughly, 10% for GPUs, 50% for CPUs). … … 34 33 about the same amount of credit per host, averaged over all hosts. 35 34 36 * Cheat-proof: there should be a bound (say, 1.1) 37 on the ratio of credit granted to credit deserved per user account, 38 regardless of what the user does. 35 * Gaming-resistance: there should be a bound on the 36 impact of faulty or malicious hosts. 39 37 40 38 == The first credit system == 41 39 42 40 In the first iteration of BOINC's credit system, 43 "claimed credit" was defined as44 45 C 1= H.whetstone * J.cpu_time41 "claimed credit" C of job J on host H was defined as 42 43 C = H.whetstone * J.cpu_time 46 44 47 45 There were then various schemes for taking the … … 52 50 it's based on the CPU's peak performance. 53 51 54 The problem with this system is that, for a given app version, 55 efficiency can vary widely between hosts. 56 In the above example, 57 the 10 GFLOPS host would claim 10X as much credit, 52 The problem with this system is that, 53 for a given app version, efficiency can vary widely between hosts. 54 In the above example, the 10 GFLOPS host would claim 10X as much credit, 58 55 and its owner would be upset when it was granted only a tenth of that. 59 56 … … 66 63 67 64 We then switched to the philosophy that 68 credit should be proportional to number ofFLOPs actually performed65 credit should be proportional to the FLOPs actually performed 69 66 by the application. 70 67 We added API calls to let applications report this. … … 87 84 * Project that can't count FLOPs still have device neutrality problems. 88 85 * It doesn't prevent credit cheating when single replication is used. 89 90 86 91 87 == Goals of the new (third) credit system == … … 102 98 grant more credit than projects with inefficient apps. That's OK). 103 99 104 == ''A priori'' job size estimates == 105 106 If we have an ''a priori'' estimate of job size, 107 we can normalize by this to reduce the variance 108 of various distributions (see below). 109 This makes estimates of the means converge more quickly. 110 111 We'll use workunit.rsc_fpops_est as this a priori estimate, 112 and denote it E(J). 113 114 (''A posteriori'' estimates of job size may exist also, 115 e.g., an iteration count reported by the app, 116 but aren't cheat-proof; we don't use them.) 100 == ''A priori'' job size estimates and bounds == 101 102 Projects supply estimates of the FLOPs used by a job 103 (wu.rsc_fpops_est) 104 and a limit on FLOPS, after which the job will be aborted 105 (wu.rsc_fpops_bound). 106 107 Previously, inaccuracy of rsc_fpops_est caused problems. 108 The new system still uses rsc_fpops_est, 109 but its primary purpose is now to indicate the relative size of jobs. 110 Averages of job sizes are normalized by rsc_fpops_est, 111 and if rsc_fpops_est is correlated with actual size, 112 these averages will converge more quickly. 113 114 We'll denote workunit.rsc_fpops_est as E(J). 115 116 Notes: 117 118 * ''A posteriori'' estimates of job size may exist also, 119 e.g., an iteration count reported by the app. 120 They aren't cheat-proof, and we don't use them. 117 121 118 122 == Peak FLOP Count (PFC) == … … 168 172 == Cross-version normalization == 169 173 170 A given application may have multiple versions (e.g., CPU and GPU versions). 174 A given application may have multiple versions 175 (e.g., CPU, multi-thread, and GPU versions). 171 176 If jobs are distributed uniformly to versions, 172 177 all versions should get the same average credit. … … 185 190 threshold, let X be the min of the averages. 186 191 192 If X is defined, then for each version V we set 193 194 Scale(V) = (X/PFC^mean^(V)) 195 196 An app version V's jobs are scaled by this factor. 197 198 For each app, we maintain min_avg_pfc(A), 199 the average PFC for the most efficient version of A. 200 This is an estimate of the app's average actual FLOPS. 201 187 202 If X is defined, then we set 188 203 189 204 min_avg_pfc(A) = X 190 205 191 This is an estimate of the app's average actual FLOPS. 192 193 We also set 194 195 Scale(V) = (X/PFC^mean^(V)) 196 197 An app version V's jobs are scaled by this factor. 206 Otherwise, if a version V is above sample threshold, we set 207 208 min_avg_pfc(A) = PFC^mean^(V) 198 209 199 210 Notes: … … 212 223 then this mechanism doesn't work as intended. 213 224 One solution is to create separate apps for separate types of jobs. 214 * Cheating or erroneous hosts can influence PFC^mean^(V) to 215 some extent. 225 * Cheating or erroneous hosts can influence PFC^mean^(V) to some extent. 216 226 This is limited by the Sanity Check mechanism, 217 227 and by the fact that only validated jobs are used. … … 272 282 == Anonymous platform == 273 283 274 For anonymous platform apps, 275 since we don't reliably know anything about the devices involved, 276 we don't try to estimate PFC. 277 278 For each app, we maintain min_avg_pfc(A), 279 the average PFC for the most efficient version of A. 280 281 The claimed credit for anonymous platform jobs is 282 283 claimed_credit^mean^(A)*E(J) 284 285 The server maintains host_app_version records for anonymous platform, 286 and it keeps track of elapsed time statistics there. 287 These have app_version_id = -2 for CPU, -3 for NVIDIA GPU, -4 for ATI. 284 For jobs done by anonymous platform apps, 285 the server knows the devices involved and can estimate PFC. 286 It maintains host_app_version records for anonymous platform, 287 and it keeps track of PFC and elapsed time statistics there. 288 There are separate records per resource type. 289 The app_version_id encodes the app ID and the resource type 290 (-2 for CPU, -3 for NVIDIA GPU, -4 for ATI). 291 292 If min_avg_pfc(A) is defined and 293 PFC^mean^(H, V) is above a sample threshold, 294 we normalize PFC by the factor 295 296 min_avg_pfc(A)/PFC^mean^(H, V) 297 298 Otherwise the claimed PFC is 299 300 min_avg_pfc(A)*E(J) 301 302 If min_avg_pfc(A) is not defined, the claimed PFC is 303 304 wu.rsc_fpops_est 305 306 == Summary == 307 308 Given a validated job J, we compute 309 310 * the "claimed PFC" F 311 * a flag "approx" that is true if F 312 is an approximation and may not be comparable 313 with other instances of the job 314 315 The algorithm: 316 317 pfc = peak FLOP count(J) 318 approx = true; 319 if pfc > wu.rsc_fpops_bound 320 if min_avg_pfc(A) is defined 321 F = min_avg_pfc(A) * E(J) 322 else 323 F = wu.rsc_fpops_est 324 else 325 if job is anonymous platform 326 hav = host_app_version record 327 if min_avg_pfc(A) is defined 328 if hav.pfc.n > threshold 329 approx = false 330 F = min_avg_pfc(A) /hav.pfc.avg 331 else 332 F = min_avg_pfc(A) * E(J) 333 else 334 F = wu.rsc_fpops_est 335 else 336 F = pfc; 337 if Scale(V) is defined 338 F *= Scale(V) 339 if Scale(H, V) is defined and (H,V) is not on scale probation 340 F *= Scale(H, V) 288 341 289 342 == Claimed and granted credit == 290 343 291 The '''claimed FLOPS''' for a given job J is 292 293 F = PFC(J) * S(V) * S(H) 294 295 and the claimed credit (in Cobblestones) is 296 297 C = F*100/86400e9 298 299 When replication is used, 300 We take the set of hosts that 301 are not anon platform and not on scale probation (see below). 344 The claimed credit of a job (in Cobblestones) is 345 346 C = F* 200/86400e9 347 348 If replication is not used, this is the granted credit. 349 350 If replication is used, 351 we take the set of instances for which approx is false. 302 352 If this set is nonempty, we grant the average of their claimed credit. 303 Otherwise we grant 304 305 claimed_credit^mean^(A)*E(J) 353 Otherwise: 354 355 if min_avg_pfc(A) is defined 356 C = min_avg_pfc(A)*E(J) 357 else 358 C = wu.rsc_fpops_est * 200/86400e9 306 359 307 360 == Cross-project version normalization == 308 361 309 362 If an application has both CPU and GPU versions, 310 the version normalization mechanism uses the CPU311 version as a "sanity check" to limit the credit granted to GPU jobs 312 (or vice versa).313 314 Supposea project has an app with only a GPU version,315 so there's no CPU version to act as a sanity check.363 the version normalization mechanism figures out 364 which version is most efficient and uses that to reduce 365 the credit granted to less-efficient versions. 366 367 If a project has an app with only a GPU version, 368 there's no CPU version for comparison. 316 369 If we grant credit based only on GPU peak speed, 317 370 the project will grant much more credit per GPU hour than other projects,