| 343 | = New credit system design = |
| 344 | |
| 345 | == Peak FLOPS and efficiency == |
| 346 | |
| 347 | BOINC estimates the peak FLOPS of each processor. |
| 348 | For CPUs, this is the Whetstone benchmark score. |
| 349 | For GPUs, it's given by a manufacturer-supplied formula. |
| 350 | |
| 351 | However, other factors affect application performance. |
| 352 | For example, applications access memory, |
| 353 | and the speed of a host's memory system is not reflected |
| 354 | in its Whetstone score. |
| 355 | So a given job might take the same amount of CPU time |
| 356 | and a 1 GFLOPS host as on a 10 GFLOPS host. |
| 357 | The "efficiency" of an application running on a given host |
| 358 | is the ratio of actual FLOPS to peak FLOPS. |
| 359 | |
| 360 | GPUs typically have a much higher (50-100X) peak FLOPS than CPUs. |
| 361 | However, application efficiency is typically lower |
| 362 | (very roughly, 10% for GPUs, 50% for CPUs). |
| 363 | |
| 364 | Notes: |
| 365 | |
| 366 | * The peaks FLOPS of a device is single or double precision, |
| 367 | whichever is higher. |
| 368 | Differentiating between single and double would unnecessarily |
| 369 | complicate things, and the distinction will disappear soon anyway. |
| 370 | |
| 371 | == Credit system goals == |
| 372 | |
| 373 | Some possible goals in designing a credit system: |
| 374 | |
| 375 | * Device neutrality: similar jobs should get similar credit |
| 376 | regardless of what processor or GPU they run on. |
| 377 | |
| 378 | * Project neutrality: different projects should grant |
| 379 | about the same amount of credit per day for a given processor. |
| 380 | |
| 381 | It's easy to show that both goals can't be satisfied simultaneously. |
| 382 | |
| 383 | == The first credit system == |
| 384 | |
| 385 | In the first iteration of BOINC's credit system, |
| 386 | "claimed credit" was defined as |
| 387 | {{{ |
| 388 | C1 = H.whetstone * J.cpu_time |
| 389 | }}} |
| 390 | There were then various schemes for taking the |
| 391 | average or min claimed credit of the replicas of a job, |
| 392 | and using that as the "granted credit". |
| 393 | |
| 394 | We call this system "Peak-FLOPS-based" because |
| 395 | it's based on the CPU's peak performance. |
| 396 | |
| 397 | The problem with this system is that, for a given app version, |
| 398 | efficiency can vary widely between hosts. |
| 399 | In the above example, |
| 400 | the 10 GFLOPS host would claim 10X as much credit, |
| 401 | and its owner would be upset when it was granted only a tenth of that. |
| 402 | |
| 403 | Furthermore, the credits granted to a given host for a |
| 404 | series of identical jobs could vary widely, |
| 405 | depending on the host it was paired with by replication. |
| 406 | This seemed arbitrary and unfair to users. |
| 407 | |
| 408 | == The second credit system == |
| 409 | |
| 410 | We then switched to the philosophy that |
| 411 | credit should be proportional to number of FLOPs actually performed |
| 412 | by the application. |
| 413 | We added API calls to let applications report this. |
| 414 | We call this approach "Actual-FLOPs-based". |
| 415 | |
| 416 | SETI@home's application allowed counting of FLOPs, |
| 417 | and they adopted this system, |
| 418 | adding a scaling factor so that average credit per job |
| 419 | was the same as the first credit system. |
| 420 | |
| 421 | Not all projects could count FLOPs, however. |
| 422 | So SETI@home published their average credit per CPU second, |
| 423 | and other projects continued to use benchmark-based credit, |
| 424 | but multiplied it by a scaling factor to match SETI@home's average. |
| 425 | |
| 426 | This system had several problems: |
| 427 | |
| 428 | * It didn't address GPUs. |
| 429 | * Project that couldn't count FLOPs still had device neutrality problems. |
| 430 | * It didn't prevent credit cheating when single replication was used. |
| 431 | |
| 432 | |
| 433 | == Goals of the new (third) credit system == |
| 434 | |
| 435 | * Completely automated - projects don't have to |
| 436 | change code, settings, etc. |
| 437 | |
| 438 | * Device neutrality |
| 439 | |
| 440 | * Limited project neutrality: different projects should grant |
| 441 | about the same amount of credit per CPU hour, averaged over hosts. |
| 442 | Projects with GPU apps should grant credit in proportion |
| 443 | to the efficiency of the apps. |
| 444 | (This means that projects with efficient GPU apps will |
| 445 | grant more credit on average. That's OK). |
| 446 | |
| 447 | == Peak FLOP Count (PFC) == |
| 448 | |
| 449 | This system uses the Peak-FLOPS-based approach, |
| 450 | but addresses its problems in a new way. |
| 451 | |
| 452 | When a job is issued to a host, the scheduler specifies usage(J,D), |
| 453 | J's usage of processing resource D: |
| 454 | how many CPUs and how many GPUs (possibly fractional). |
| 455 | |
| 456 | If the job is finished in elapsed time T, |
| 457 | we define peak_flop_count(J), or PFC(J) as |
| 458 | {{{ |
| 459 | PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D)) |
| 460 | }}} |
| 461 | |
| 462 | Notes: |
| 463 | |
| 464 | * We use elapsed time instead of actual device time (e.g., CPU time). |
| 465 | If a job uses a resource inefficiently |
| 466 | (e.g., a CPU job that does lots of disk I/O) |
| 467 | PFC() won't reflect this. That's OK. |
| 468 | The key thing is that BOINC reserved the device for the job, |
| 469 | whether or not the job used it efficiently. |
| 470 | * usage(J,D) may not be accurate; e.g., a GPU job may take |
| 471 | more or less CPU than the scheduler thinks it will. |
| 472 | Eventually we may switch to a scheme where the client |
| 473 | dynamically determines the CPU usage. |
| 474 | For now, though, we'll just use the scheduler's estimate. |
| 475 | |
| 476 | The granted credit for a job J is proportional to PFC(J), |
| 477 | but is normalized in the following ways: |
| 478 | |
| 479 | == Cross-version normalization == |
| 480 | |
| 481 | If a given application has multiple versions (e.g., CPU and GPU versions) |
| 482 | the granted credit per job is adjusted |
| 483 | so that the average is the same for each version. |
| 484 | The adjustment is always downwards: |
| 485 | we maintain the average PFC^mean^(V) of PFC() for each app version V, |
| 486 | find the minimum X. |
| 487 | An app version V's jobs are then scaled by the factor |
| 488 | |
| 489 | S(V) = (X/PFC^mean^(V)) |
| 490 | |
| 491 | |
| 492 | The result for a given job J |
| 493 | is called "Version-Normalized Peak FLOP Count", or VNPFC(J): |
| 494 | |
| 495 | VNPFC(J) = PFC(J) * (X/PFC^mean^(V)) |
| 496 | |
| 497 | Notes: |
| 498 | * This addresses the common situation |
| 499 | where an app's GPU version is much less efficient than the CPU version |
| 500 | (i.e. the ratio of actual FLOPs to peak FLOPs is much less). |
| 501 | To a certain extent, this mechanism shifts the system |
| 502 | towards the "Actual FLOPs" philosophy, |
| 503 | since credit is granted based on the most efficient app version. |
| 504 | It's not exactly "Actual FLOPs", since the most efficient |
| 505 | version may not be 100% efficient. |
| 506 | * There are two sources of variance in PFC(V): |
| 507 | the variation in host efficiency, |
| 508 | and possibly the variation in job size. |
| 509 | If we have an ''a priori'' estimate of job size |
| 510 | (e.g., workunit.rsc_fpops_est) |
| 511 | we can normalize by this to reduce the variance, |
| 512 | and make PFC^mean^(V) converge more quickly. |
| 513 | * ''a posteriori'' estimates of job size may exist also |
| 514 | (e.g., an iteration count reported by the app) |
| 515 | but using this for anything introduces a new cheating risk, |
| 516 | so it's probably better not to. |
| 517 | |
| 518 | |
| 519 | == Cross-project normalization == |
| 520 | |
| 521 | If an application has both CPU and GPU versions, |
| 522 | then the version normalization mechanism uses the CPU |
| 523 | version as a "sanity check" to limit the credit granted to GPU jobs. |
| 524 | |
| 525 | Suppose a project has an app with only a GPU version, |
| 526 | so there's no CPU version to act as a sanity check. |
| 527 | If we grant credit based only on GPU peak speed, |
| 528 | the project will grant much more credit per GPU hour than other projects, |
| 529 | violating limited project neutrality. |
| 530 | |
| 531 | A solution to this: if an app has only GPU versions, |
| 532 | then for each version V we let |
| 533 | S(V) be the average scaling factor |
| 534 | for that GPU type among projects that do have both CPU and GPU versions. |
| 535 | This factor is obtained from a central BOINC server. |
| 536 | V's jobs are then scaled by S(V) as above. |
| 537 | |
| 538 | Notes: |
| 539 | |
| 540 | * Projects will run a periodic script to update the scaling factors. |
| 541 | * Rather than GPU type, we'll probably use plan class, |
| 542 | since e.g. the average efficiency of CUDA 2.3 apps may be different |
| 543 | than that of CUDA 2.1 apps. |
| 544 | * Initially we'll obtain scaling factors from large projects |
| 545 | that have both GPU and CPU apps (e.g., SETI@home). |
| 546 | Eventually we'll use an average (weighted by work done) over multiple projects |
| 547 | (see below). |
| 548 | |
| 549 | == Host normalization == |
| 550 | |
| 551 | Assuming that hosts are sent jobs for a given app uniformly, |
| 552 | then, for that app, |
| 553 | hosts should get the same average granted credit per job. |
| 554 | To ensure this, for each application A we maintain the average VNPFC^mean^(A), |
| 555 | and for each host H we maintain VNPFC^mean^(H, A). |
| 556 | The '''claimed FLOPS''' for a given job J is then |
| 557 | |
| 558 | F = VNPFC(J) * (VNPFC^mean^(A)/VNPFC^mean^(H, A)) |
| 559 | |
| 560 | and the claimed credit (in Cobblestones) is |
| 561 | |
| 562 | C = F*100/86400e9 |
| 563 | |
| 564 | There are some cases where hosts are not sent jobs uniformly: |
| 565 | * job-size matching (smaller jobs sent to slower hosts) |
| 566 | * GPUGrid.net's scheme for sending some (presumably larger) |
| 567 | jobs to GPUs with more processors. |
| 568 | In these cases average credit per job must differ between hosts, |
| 569 | according to the types of jobs that are sent to them. |
| 570 | |
| 571 | This can be done by dividing |
| 572 | each sample in the computation of VNPFC^mean^ by WU.rsc_fpops_est |
| 573 | (in fact, there's no reason not to always do this). |
| 574 | |
| 575 | Notes: |
| 576 | * The host normalization mechanism reduces the claimed credit of hosts |
| 577 | that are less efficient than average, |
| 578 | and increases the claimed credit of hosts that are more efficient |
| 579 | than average. |
| 580 | * VNPFC^mean^ is averaged over jobs, not hosts. |
| 581 | |
| 582 | == Computing averages == |
| 583 | |
| 584 | We need to compute averages carefully because |
| 585 | |
| 586 | * The quantities being averaged may gradually change over time |
| 587 | (e.g. average job size may change, |
| 588 | app version efficiency may change as new versions are deployed) |
| 589 | and we need to track this. |
| 590 | * A given sample may be wildly off, |
| 591 | and we can't let this mess up the average. |
| 592 | * Averages should be weighted by job size. |
| 593 | |
| 594 | In addition, we may as well maintain the variance of the quantities, |
| 595 | although the current system doesn't use it. |
| 596 | |
| 597 | The code that does all this is |
| 598 | [http://boinc.berkeley.edu/trac/browser/trunk/boinc/lib/average.h here]. |
| 599 | |
| 600 | == Cross-project scaling factors == |
| 601 | |
| 602 | We'll have a script that publishes a project's |
| 603 | accounting data (see Implementation). |
| 604 | The BOINC web site will collect these from a few big projects |
| 605 | and publish the averages. |
| 606 | |
| 607 | == Replication and cheating == |
| 608 | |
| 609 | Host normalization mostly eliminates the incentive to cheat |
| 610 | by claiming excessive credit |
| 611 | (i.e., by falsifying benchmark scores or elapsed time). |
| 612 | An exaggerated claim will increase VNPFC*(H,A), |
| 613 | causing subsequent claimed credit to be scaled down proportionately. |
| 614 | This means that no special cheat-prevention scheme |
| 615 | is needed for single replications; |
| 616 | granted credit = claimed credit. |
| 617 | |
| 618 | For jobs that are replicated, granted credit should be |
| 619 | set to the min of the valid results |
| 620 | (min is used instead of average to remove the incentive |
| 621 | for cherry-picking, see below). |
| 622 | |
| 623 | However, there are still some possible forms of cheating. |
| 624 | |
| 625 | * One-time cheats (like claiming 1e304) can be prevented by |
| 626 | capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A). |
| 627 | * Cherry-picking: suppose an application has two types of jobs, |
| 628 | which run for 1 second and 1 hour respectively. |
| 629 | Clients can figure out which is which, e.g. by running a job for 2 seconds |
| 630 | and seeing if it's exited. |
| 631 | Suppose a client systematically refuses the 1 hour jobs |
| 632 | (e.g., by reporting a crash or never reporting them). |
| 633 | Its VNPFC^mean^(H, A) will quickly decrease, |
| 634 | and soon it will be getting several thousand times more credit |
| 635 | per actual work than other hosts! |
| 636 | Countermeasure: |
| 637 | whenever a job errors out, times out, or fails to validate, |
| 638 | set the host's error rate back to the initial default, |
| 639 | and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A. |
| 640 | This puts the host to a state where several dozen of its |
| 641 | subsequent jobs will be replicated. |
| 642 | |
| 643 | == Trickle credit == |
| 644 | |
| 645 | |
| 646 | == Job runtime estimates == |
| 647 | |
| 648 | Unrelated to the credit proposal, but in a similar spirit. |
| 649 | The server will maintain ET^mean^(H, V), the statistics of |
| 650 | job runtimes (normalized by wu.rsc_fpops_est) per |
| 651 | host and application version. |
| 652 | |
| 653 | The server's estimate of a job's runtime is then |
| 654 | |
| 655 | R(J, H) = wu.rsc_fpops_est * ET^mean^(H, V) |
| 656 | |
| 657 | |
| 658 | == Error rate,host punishment, and turnaround time estimation == |
| 659 | |
| 660 | Unrelated to the credit proposal, but in a similar spirit. |
| 661 | |
| 662 | Due to hardware problems (e.g. a malfunctioning GPU) |
| 663 | a host may have a 100% error rate for one app version |
| 664 | and a 0% error rate for another. |
| 665 | Similar for turnaround time. |
| 666 | |
| 667 | The host punishment mechanism is designed to |
| 668 | deal with malfunctioning hosts. |
| 669 | For each host the server maintains '''max_results_day'''. |
| 670 | This is initialized to a project-specified value (e.g. 200) |
| 671 | and scaled by the number of CPUs and/or GPUs. |
| 672 | It's decremented if the client reports a crash |
| 673 | (but not if the job was aborted). |
| 674 | It's doubled when a successful (but not necessarily valid) |
| 675 | result is received. |
| 676 | |
| 677 | So we'll move the "error_rate", "turnaround_time" |
| 678 | and "max_results_day" fields from the host table to host_app_version. |
| 679 | |
| 680 | == Cherry picking == |
| 681 | |
| 682 | Suppose an application has a mix of long and short jobs. |
| 683 | If a client intentionally discards |
| 684 | (or aborts, or reports errors from) the long jobs, |
| 685 | but completes the short jobs, |
| 686 | its host scaling factor will become large, |
| 687 | and it will get excessive credit for the short jobs. |
| 688 | This is called "cherry picking". |
| 689 | |
| 690 | Note: the host punishment mechanism |
| 691 | doesn't deal effectively with cherry picking, |
| 692 | |
| 693 | We propose the following mechanism to deal with cherry picking: |
| 694 | |
| 695 | * For each (host, app version), maintain "host_scale_time". |
| 696 | This is the earliest time at which host scaling will be applied. |
| 697 | This is set initially, and each time a job times out |
| 698 | or errors out, to now+X, where X is the app's delay bound. |
| 699 | |
| 700 | The idea is to apply the host scaling factor |
| 701 | only if there's solid evidence that the host is NOT cherry picking. |
| 702 | |
| 703 | == Implementation == |
| 704 | |
| 705 | === Database changes === |
| 706 | |
| 707 | New table '''host_app''': |
| 708 | {{{ |
| 709 | int host_id; |
| 710 | int app_id; |
| 711 | int vnpfc_n; |
| 712 | double vnpfc_sum; |
| 713 | double vnpfc_exp_avg; |
| 714 | }}} |
| 715 | |
| 716 | New table '''host_app_version''': |
| 717 | {{{ |
| 718 | int host_id; |
| 719 | int app_version_id; |
| 720 | int et_n; |
| 721 | double et_sum; |
| 722 | double et_exp_avg; |
| 723 | // some variable for recent error rate, |
| 724 | // replacing host.error_rate and host.max_results_day |
| 725 | // make sure that a host w/ 1 good and 1 bad GPU gets few GPU jobs |
| 726 | }}} |
| 727 | |
| 728 | New fields in '''app_version''': |
| 729 | {{{ |