| 343 | | = New credit system design = |
| 344 | | |
| 345 | | == Peak FLOPS and efficiency == |
| 346 | | |
| 347 | | BOINC estimates the peak FLOPS of each processor. |
| 348 | | For CPUs, this is the Whetstone benchmark score. |
| 349 | | For GPUs, it's given by a manufacturer-supplied formula. |
| 350 | | |
| 351 | | However, other factors affect application performance. |
| 352 | | For example, applications access memory, |
| 353 | | and the speed of a host's memory system is not reflected |
| 354 | | in its Whetstone score. |
| 355 | | So a given job might take the same amount of CPU time |
| 356 | | and a 1 GFLOPS host as on a 10 GFLOPS host. |
| 357 | | The "efficiency" of an application running on a given host |
| 358 | | is the ratio of actual FLOPS to peak FLOPS. |
| 359 | | |
| 360 | | GPUs typically have a much higher (50-100X) peak FLOPS than CPUs. |
| 361 | | However, application efficiency is typically lower |
| 362 | | (very roughly, 10% for GPUs, 50% for CPUs). |
| 363 | | |
| 364 | | Notes: |
| 365 | | |
| 366 | | * The peaks FLOPS of a device is single or double precision, |
| 367 | | whichever is higher. |
| 368 | | Differentiating between single and double would unnecessarily |
| 369 | | complicate things, and the distinction will disappear soon anyway. |
| 370 | | |
| 371 | | == Credit system goals == |
| 372 | | |
| 373 | | Some possible goals in designing a credit system: |
| 374 | | |
| 375 | | * Device neutrality: similar jobs should get similar credit |
| 376 | | regardless of what processor or GPU they run on. |
| 377 | | |
| 378 | | * Project neutrality: different projects should grant |
| 379 | | about the same amount of credit per day for a given processor. |
| 380 | | |
| 381 | | It's easy to show that both goals can't be satisfied simultaneously. |
| 382 | | |
| 383 | | == The first credit system == |
| 384 | | |
| 385 | | In the first iteration of BOINC's credit system, |
| 386 | | "claimed credit" was defined as |
| 387 | | {{{ |
| 388 | | C1 = H.whetstone * J.cpu_time |
| 389 | | }}} |
| 390 | | There were then various schemes for taking the |
| 391 | | average or min claimed credit of the replicas of a job, |
| 392 | | and using that as the "granted credit". |
| 393 | | |
| 394 | | We call this system "Peak-FLOPS-based" because |
| 395 | | it's based on the CPU's peak performance. |
| 396 | | |
| 397 | | The problem with this system is that, for a given app version, |
| 398 | | efficiency can vary widely between hosts. |
| 399 | | In the above example, |
| 400 | | the 10 GFLOPS host would claim 10X as much credit, |
| 401 | | and its owner would be upset when it was granted only a tenth of that. |
| 402 | | |
| 403 | | Furthermore, the credits granted to a given host for a |
| 404 | | series of identical jobs could vary widely, |
| 405 | | depending on the host it was paired with by replication. |
| 406 | | This seemed arbitrary and unfair to users. |
| 407 | | |
| 408 | | == The second credit system == |
| 409 | | |
| 410 | | We then switched to the philosophy that |
| 411 | | credit should be proportional to number of FLOPs actually performed |
| 412 | | by the application. |
| 413 | | We added API calls to let applications report this. |
| 414 | | We call this approach "Actual-FLOPs-based". |
| 415 | | |
| 416 | | SETI@home's application allowed counting of FLOPs, |
| 417 | | and they adopted this system, |
| 418 | | adding a scaling factor so that average credit per job |
| 419 | | was the same as the first credit system. |
| 420 | | |
| 421 | | Not all projects could count FLOPs, however. |
| 422 | | So SETI@home published their average credit per CPU second, |
| 423 | | and other projects continued to use benchmark-based credit, |
| 424 | | but multiplied it by a scaling factor to match SETI@home's average. |
| 425 | | |
| 426 | | This system had several problems: |
| 427 | | |
| 428 | | * It didn't address GPUs. |
| 429 | | * Project that couldn't count FLOPs still had device neutrality problems. |
| 430 | | * It didn't prevent credit cheating when single replication was used. |
| 431 | | |
| 432 | | |
| 433 | | == Goals of the new (third) credit system == |
| 434 | | |
| 435 | | * Completely automated - projects don't have to |
| 436 | | change code, settings, etc. |
| 437 | | |
| 438 | | * Device neutrality |
| 439 | | |
| 440 | | * Limited project neutrality: different projects should grant |
| 441 | | about the same amount of credit per CPU hour, averaged over hosts. |
| 442 | | Projects with GPU apps should grant credit in proportion |
| 443 | | to the efficiency of the apps. |
| 444 | | (This means that projects with efficient GPU apps will |
| 445 | | grant more credit on average. That's OK). |
| 446 | | |
| 447 | | == Peak FLOP Count (PFC) == |
| 448 | | |
| 449 | | This system uses the Peak-FLOPS-based approach, |
| 450 | | but addresses its problems in a new way. |
| 451 | | |
| 452 | | When a job is issued to a host, the scheduler specifies usage(J,D), |
| 453 | | J's usage of processing resource D: |
| 454 | | how many CPUs and how many GPUs (possibly fractional). |
| 455 | | |
| 456 | | If the job is finished in elapsed time T, |
| 457 | | we define peak_flop_count(J), or PFC(J) as |
| 458 | | {{{ |
| 459 | | PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D)) |
| 460 | | }}} |
| 461 | | |
| 462 | | Notes: |
| 463 | | |
| 464 | | * We use elapsed time instead of actual device time (e.g., CPU time). |
| 465 | | If a job uses a resource inefficiently |
| 466 | | (e.g., a CPU job that does lots of disk I/O) |
| 467 | | PFC() won't reflect this. That's OK. |
| 468 | | The key thing is that BOINC reserved the device for the job, |
| 469 | | whether or not the job used it efficiently. |
| 470 | | * usage(J,D) may not be accurate; e.g., a GPU job may take |
| 471 | | more or less CPU than the scheduler thinks it will. |
| 472 | | Eventually we may switch to a scheme where the client |
| 473 | | dynamically determines the CPU usage. |
| 474 | | For now, though, we'll just use the scheduler's estimate. |
| 475 | | |
| 476 | | The granted credit for a job J is proportional to PFC(J), |
| 477 | | but is normalized in the following ways: |
| 478 | | |
| 479 | | == Cross-version normalization == |
| 480 | | |
| 481 | | If a given application has multiple versions (e.g., CPU and GPU versions) |
| 482 | | the granted credit per job is adjusted |
| 483 | | so that the average is the same for each version. |
| 484 | | The adjustment is always downwards: |
| 485 | | we maintain the average PFC^mean^(V) of PFC() for each app version V, |
| 486 | | find the minimum X. |
| 487 | | An app version V's jobs are then scaled by the factor |
| 488 | | |
| 489 | | S(V) = (X/PFC^mean^(V)) |
| 490 | | |
| 491 | | |
| 492 | | The result for a given job J |
| 493 | | is called "Version-Normalized Peak FLOP Count", or VNPFC(J): |
| 494 | | |
| 495 | | VNPFC(J) = PFC(J) * (X/PFC^mean^(V)) |
| 496 | | |
| 497 | | Notes: |
| 498 | | * This addresses the common situation |
| 499 | | where an app's GPU version is much less efficient than the CPU version |
| 500 | | (i.e. the ratio of actual FLOPs to peak FLOPs is much less). |
| 501 | | To a certain extent, this mechanism shifts the system |
| 502 | | towards the "Actual FLOPs" philosophy, |
| 503 | | since credit is granted based on the most efficient app version. |
| 504 | | It's not exactly "Actual FLOPs", since the most efficient |
| 505 | | version may not be 100% efficient. |
| 506 | | * There are two sources of variance in PFC(V): |
| 507 | | the variation in host efficiency, |
| 508 | | and possibly the variation in job size. |
| 509 | | If we have an ''a priori'' estimate of job size |
| 510 | | (e.g., workunit.rsc_fpops_est) |
| 511 | | we can normalize by this to reduce the variance, |
| 512 | | and make PFC^mean^(V) converge more quickly. |
| 513 | | * ''a posteriori'' estimates of job size may exist also |
| 514 | | (e.g., an iteration count reported by the app) |
| 515 | | but using this for anything introduces a new cheating risk, |
| 516 | | so it's probably better not to. |
| 517 | | |
| 518 | | |
| 519 | | == Cross-project normalization == |
| 520 | | |
| 521 | | If an application has both CPU and GPU versions, |
| 522 | | then the version normalization mechanism uses the CPU |
| 523 | | version as a "sanity check" to limit the credit granted to GPU jobs. |
| 524 | | |
| 525 | | Suppose a project has an app with only a GPU version, |
| 526 | | so there's no CPU version to act as a sanity check. |
| 527 | | If we grant credit based only on GPU peak speed, |
| 528 | | the project will grant much more credit per GPU hour than other projects, |
| 529 | | violating limited project neutrality. |
| 530 | | |
| 531 | | A solution to this: if an app has only GPU versions, |
| 532 | | then for each version V we let |
| 533 | | S(V) be the average scaling factor |
| 534 | | for that GPU type among projects that do have both CPU and GPU versions. |
| 535 | | This factor is obtained from a central BOINC server. |
| 536 | | V's jobs are then scaled by S(V) as above. |
| 537 | | |
| 538 | | Notes: |
| 539 | | |
| 540 | | * Projects will run a periodic script to update the scaling factors. |
| 541 | | * Rather than GPU type, we'll probably use plan class, |
| 542 | | since e.g. the average efficiency of CUDA 2.3 apps may be different |
| 543 | | than that of CUDA 2.1 apps. |
| 544 | | * Initially we'll obtain scaling factors from large projects |
| 545 | | that have both GPU and CPU apps (e.g., SETI@home). |
| 546 | | Eventually we'll use an average (weighted by work done) over multiple projects |
| 547 | | (see below). |
| 548 | | |
| 549 | | == Host normalization == |
| 550 | | |
| 551 | | Assuming that hosts are sent jobs for a given app uniformly, |
| 552 | | then, for that app, |
| 553 | | hosts should get the same average granted credit per job. |
| 554 | | To ensure this, for each application A we maintain the average VNPFC^mean^(A), |
| 555 | | and for each host H we maintain VNPFC^mean^(H, A). |
| 556 | | The '''claimed FLOPS''' for a given job J is then |
| 557 | | |
| 558 | | F = VNPFC(J) * (VNPFC^mean^(A)/VNPFC^mean^(H, A)) |
| 559 | | |
| 560 | | and the claimed credit (in Cobblestones) is |
| 561 | | |
| 562 | | C = F*100/86400e9 |
| 563 | | |
| 564 | | There are some cases where hosts are not sent jobs uniformly: |
| 565 | | * job-size matching (smaller jobs sent to slower hosts) |
| 566 | | * GPUGrid.net's scheme for sending some (presumably larger) |
| 567 | | jobs to GPUs with more processors. |
| 568 | | In these cases average credit per job must differ between hosts, |
| 569 | | according to the types of jobs that are sent to them. |
| 570 | | |
| 571 | | This can be done by dividing |
| 572 | | each sample in the computation of VNPFC^mean^ by WU.rsc_fpops_est |
| 573 | | (in fact, there's no reason not to always do this). |
| 574 | | |
| 575 | | Notes: |
| 576 | | * The host normalization mechanism reduces the claimed credit of hosts |
| 577 | | that are less efficient than average, |
| 578 | | and increases the claimed credit of hosts that are more efficient |
| 579 | | than average. |
| 580 | | * VNPFC^mean^ is averaged over jobs, not hosts. |
| 581 | | |
| 582 | | == Computing averages == |
| 583 | | |
| 584 | | We need to compute averages carefully because |
| 585 | | |
| 586 | | * The quantities being averaged may gradually change over time |
| 587 | | (e.g. average job size may change, |
| 588 | | app version efficiency may change as new versions are deployed) |
| 589 | | and we need to track this. |
| 590 | | * A given sample may be wildly off, |
| 591 | | and we can't let this mess up the average. |
| 592 | | * Averages should be weighted by job size. |
| 593 | | |
| 594 | | In addition, we may as well maintain the variance of the quantities, |
| 595 | | although the current system doesn't use it. |
| 596 | | |
| 597 | | The code that does all this is |
| 598 | | [http://boinc.berkeley.edu/trac/browser/trunk/boinc/lib/average.h here]. |
| 599 | | |
| 600 | | == Cross-project scaling factors == |
| 601 | | |
| 602 | | We'll have a script that publishes a project's |
| 603 | | accounting data (see Implementation). |
| 604 | | The BOINC web site will collect these from a few big projects |
| 605 | | and publish the averages. |
| 606 | | |
| 607 | | == Replication and cheating == |
| 608 | | |
| 609 | | Host normalization mostly eliminates the incentive to cheat |
| 610 | | by claiming excessive credit |
| 611 | | (i.e., by falsifying benchmark scores or elapsed time). |
| 612 | | An exaggerated claim will increase VNPFC*(H,A), |
| 613 | | causing subsequent claimed credit to be scaled down proportionately. |
| 614 | | This means that no special cheat-prevention scheme |
| 615 | | is needed for single replications; |
| 616 | | granted credit = claimed credit. |
| 617 | | |
| 618 | | For jobs that are replicated, granted credit should be |
| 619 | | set to the min of the valid results |
| 620 | | (min is used instead of average to remove the incentive |
| 621 | | for cherry-picking, see below). |
| 622 | | |
| 623 | | However, there are still some possible forms of cheating. |
| 624 | | |
| 625 | | * One-time cheats (like claiming 1e304) can be prevented by |
| 626 | | capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A). |
| 627 | | * Cherry-picking: suppose an application has two types of jobs, |
| 628 | | which run for 1 second and 1 hour respectively. |
| 629 | | Clients can figure out which is which, e.g. by running a job for 2 seconds |
| 630 | | and seeing if it's exited. |
| 631 | | Suppose a client systematically refuses the 1 hour jobs |
| 632 | | (e.g., by reporting a crash or never reporting them). |
| 633 | | Its VNPFC^mean^(H, A) will quickly decrease, |
| 634 | | and soon it will be getting several thousand times more credit |
| 635 | | per actual work than other hosts! |
| 636 | | Countermeasure: |
| 637 | | whenever a job errors out, times out, or fails to validate, |
| 638 | | set the host's error rate back to the initial default, |
| 639 | | and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A. |
| 640 | | This puts the host to a state where several dozen of its |
| 641 | | subsequent jobs will be replicated. |
| 642 | | |
| 643 | | == Trickle credit == |
| 644 | | |
| 645 | | CPDN breaks jobs into segments, |
| 646 | | sends a trickle-up message for each segment, |
| 647 | | and grants credit for each completed segment. |
| 648 | | In this case, |
| 649 | | the trickle message handlers should not grant a fixed amount of credit. |
| 650 | | Instead, the trickle-up messages should contain |
| 651 | | an "incremental elapsed time" field. |
| 652 | | |
| 653 | | == Job runtime estimates == |
| 654 | | |
| 655 | | Unrelated to the credit proposal, but in a similar spirit. |
| 656 | | The server will maintain ET^mean^(H, V), the statistics of |
| 657 | | job runtimes (normalized by wu.rsc_fpops_est) per |
| 658 | | host and application version. |
| 659 | | |
| 660 | | The server's estimate of a job's runtime is then |
| 661 | | |
| 662 | | R(J, H) = wu.rsc_fpops_est * ET^mean^(H, V) |
| 663 | | |
| 664 | | |
| 665 | | == Error rate, host punishment, and turnaround time estimation == |
| 666 | | |
| 667 | | Unrelated to the credit proposal, but in a similar spirit. |
| 668 | | |
| 669 | | Due to hardware problems (e.g. a malfunctioning GPU) |
| 670 | | a host may have a 100% error rate for one app version |
| 671 | | and a 0% error rate for another. |
| 672 | | Similar for turnaround time. |
| 673 | | |
| 674 | | So we'll move the "error_rate" and "turnaround_time" |
| 675 | | fields from the host table to host_app_version. |
| 676 | | |
| 677 | | The host punishment mechanism is designed to deal with malfunctioning hosts. |
| 678 | | For each host the server maintains '''max_results_day'''. |
| 679 | | This is initialized to a project-specified value (e.g. 200) |
| 680 | | and scaled by the number of CPUs and/or GPUs. |
| 681 | | It's decremented if the client reports a crash |
| 682 | | (but not if the job was aborted). |
| 683 | | It's doubled when a successful (but not necessarily valid) |
| 684 | | result is received. |
| 685 | | |
| 686 | | This should also be per-app-version, |
| 687 | | so we'll move "max_results_day" from the host table to host_app_version. |
| 688 | | |
| 689 | | == Cherry picking == |
| 690 | | |
| 691 | | Suppose an application has a mix of long and short jobs. |
| 692 | | If a client intentionally discards |
| 693 | | (or aborts, or reports errors from) the long jobs, |
| 694 | | but completes the short jobs, |
| 695 | | its host scaling factor will become large, |
| 696 | | and it will get excessive credit for the short jobs. |
| 697 | | This is called "cherry picking". |
| 698 | | |
| 699 | | The host punishment mechanism |
| 700 | | doesn't deal effectively with cherry picking, |
| 701 | | |
| 702 | | We propose the following mechanism to deal with cherry picking: |
| 703 | | |
| 704 | | * For each (host, app version) maintain "host_scale_time". |
| 705 | | This is the earliest time at which host scaling will be applied. |
| 706 | | * for each (host, app version) maintain "scale_probation" |
| 707 | | (initially true). |
| 708 | | * When send a job to a host, |
| 709 | | if scale_probation is true, |
| 710 | | set host_scale_time to now+X, where X is the app's delay bound. |
| 711 | | * When a job is successfully validated, |
| 712 | | and now > host_scale_time, |
| 713 | | set scale_probation to false. |
| 714 | | * If a job times out or errors out, |
| 715 | | set scale_probation to true, |
| 716 | | max the scale factor with 1, |
| 717 | | and set host_scale_time to now+X. |
| 718 | | * when computing claimed credit for a job, |
| 719 | | and now < host_scale_time, don't use the host scale factor |
| 720 | | |
| 721 | | The idea is to apply the host scaling factor |
| 722 | | only if there's solid evidence that the host is NOT cherry picking. |
| 723 | | |
| 724 | | Because this mechanism is punitive to hosts |
| 725 | | that experience actual failures, |
| 726 | | we'll make it selectable on a per-application basis (default off). |
| 727 | | |
| 728 | | In addition, to limit the extent of cheating |
| 729 | | (in case the above mechanism is defeated somehow) |
| 730 | | the host scaling factor will be min'd with a |
| 731 | | project-wide config parameter (default, say, 3). |
| 732 | | |
| 733 | | == Implementation == |
| 734 | | |
| 735 | | === Database changes === |
| 736 | | |
| 737 | | New table '''host_app''': |
| 738 | | {{{ |
| 739 | | int host_id; |
| 740 | | int app_id; |
| 741 | | int vnpfc_n; |
| 742 | | double vnpfc_sum; |
| 743 | | double vnpfc_exp_avg; |
| 744 | | }}} |
| 745 | | |
| 746 | | New table '''host_app_version''': |
| 747 | | {{{ |
| 748 | | int host_id; |
| 749 | | int app_version_id; |
| 750 | | int et_n; |
| 751 | | double et_sum; |
| 752 | | double et_exp_avg; |
| 753 | | // some variable for recent error rate, |
| 754 | | // replacing host.error_rate and host.max_results_day |
| 755 | | // make sure that a host w/ 1 good and 1 bad GPU gets few GPU jobs |
| 756 | | }}} |
| 757 | | |
| 758 | | New fields in '''app_version''': |
| 759 | | {{{ |