| | 343 | = New credit system design = |
| | 344 | |
| | 345 | == Peak FLOPS and efficiency == |
| | 346 | |
| | 347 | BOINC estimates the peak FLOPS of each processor. |
| | 348 | For CPUs, this is the Whetstone benchmark score. |
| | 349 | For GPUs, it's given by a manufacturer-supplied formula. |
| | 350 | |
| | 351 | However, other factors affect application performance. |
| | 352 | For example, applications access memory, |
| | 353 | and the speed of a host's memory system is not reflected |
| | 354 | in its Whetstone score. |
| | 355 | So a given job might take the same amount of CPU time |
| | 356 | and a 1 GFLOPS host as on a 10 GFLOPS host. |
| | 357 | The "efficiency" of an application running on a given host |
| | 358 | is the ratio of actual FLOPS to peak FLOPS. |
| | 359 | |
| | 360 | GPUs typically have a much higher (50-100X) peak FLOPS than CPUs. |
| | 361 | However, application efficiency is typically lower |
| | 362 | (very roughly, 10% for GPUs, 50% for CPUs). |
| | 363 | |
| | 364 | Notes: |
| | 365 | |
| | 366 | * The peaks FLOPS of a device is single or double precision, |
| | 367 | whichever is higher. |
| | 368 | Differentiating between single and double would unnecessarily |
| | 369 | complicate things, and the distinction will disappear soon anyway. |
| | 370 | |
| | 371 | == Credit system goals == |
| | 372 | |
| | 373 | Some possible goals in designing a credit system: |
| | 374 | |
| | 375 | * Device neutrality: similar jobs should get similar credit |
| | 376 | regardless of what processor or GPU they run on. |
| | 377 | |
| | 378 | * Project neutrality: different projects should grant |
| | 379 | about the same amount of credit per day for a given processor. |
| | 380 | |
| | 381 | It's easy to show that both goals can't be satisfied simultaneously. |
| | 382 | |
| | 383 | == The first credit system == |
| | 384 | |
| | 385 | In the first iteration of BOINC's credit system, |
| | 386 | "claimed credit" was defined as |
| | 387 | {{{ |
| | 388 | C1 = H.whetstone * J.cpu_time |
| | 389 | }}} |
| | 390 | There were then various schemes for taking the |
| | 391 | average or min claimed credit of the replicas of a job, |
| | 392 | and using that as the "granted credit". |
| | 393 | |
| | 394 | We call this system "Peak-FLOPS-based" because |
| | 395 | it's based on the CPU's peak performance. |
| | 396 | |
| | 397 | The problem with this system is that, for a given app version, |
| | 398 | efficiency can vary widely between hosts. |
| | 399 | In the above example, |
| | 400 | the 10 GFLOPS host would claim 10X as much credit, |
| | 401 | and its owner would be upset when it was granted only a tenth of that. |
| | 402 | |
| | 403 | Furthermore, the credits granted to a given host for a |
| | 404 | series of identical jobs could vary widely, |
| | 405 | depending on the host it was paired with by replication. |
| | 406 | This seemed arbitrary and unfair to users. |
| | 407 | |
| | 408 | == The second credit system == |
| | 409 | |
| | 410 | We then switched to the philosophy that |
| | 411 | credit should be proportional to number of FLOPs actually performed |
| | 412 | by the application. |
| | 413 | We added API calls to let applications report this. |
| | 414 | We call this approach "Actual-FLOPs-based". |
| | 415 | |
| | 416 | SETI@home's application allowed counting of FLOPs, |
| | 417 | and they adopted this system, |
| | 418 | adding a scaling factor so that average credit per job |
| | 419 | was the same as the first credit system. |
| | 420 | |
| | 421 | Not all projects could count FLOPs, however. |
| | 422 | So SETI@home published their average credit per CPU second, |
| | 423 | and other projects continued to use benchmark-based credit, |
| | 424 | but multiplied it by a scaling factor to match SETI@home's average. |
| | 425 | |
| | 426 | This system had several problems: |
| | 427 | |
| | 428 | * It didn't address GPUs. |
| | 429 | * Project that couldn't count FLOPs still had device neutrality problems. |
| | 430 | * It didn't prevent credit cheating when single replication was used. |
| | 431 | |
| | 432 | |
| | 433 | == Goals of the new (third) credit system == |
| | 434 | |
| | 435 | * Completely automated - projects don't have to |
| | 436 | change code, settings, etc. |
| | 437 | |
| | 438 | * Device neutrality |
| | 439 | |
| | 440 | * Limited project neutrality: different projects should grant |
| | 441 | about the same amount of credit per CPU hour, averaged over hosts. |
| | 442 | Projects with GPU apps should grant credit in proportion |
| | 443 | to the efficiency of the apps. |
| | 444 | (This means that projects with efficient GPU apps will |
| | 445 | grant more credit on average. That's OK). |
| | 446 | |
| | 447 | == Peak FLOP Count (PFC) == |
| | 448 | |
| | 449 | This system uses the Peak-FLOPS-based approach, |
| | 450 | but addresses its problems in a new way. |
| | 451 | |
| | 452 | When a job is issued to a host, the scheduler specifies usage(J,D), |
| | 453 | J's usage of processing resource D: |
| | 454 | how many CPUs and how many GPUs (possibly fractional). |
| | 455 | |
| | 456 | If the job is finished in elapsed time T, |
| | 457 | we define peak_flop_count(J), or PFC(J) as |
| | 458 | {{{ |
| | 459 | PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D)) |
| | 460 | }}} |
| | 461 | |
| | 462 | Notes: |
| | 463 | |
| | 464 | * We use elapsed time instead of actual device time (e.g., CPU time). |
| | 465 | If a job uses a resource inefficiently |
| | 466 | (e.g., a CPU job that does lots of disk I/O) |
| | 467 | PFC() won't reflect this. That's OK. |
| | 468 | The key thing is that BOINC reserved the device for the job, |
| | 469 | whether or not the job used it efficiently. |
| | 470 | * usage(J,D) may not be accurate; e.g., a GPU job may take |
| | 471 | more or less CPU than the scheduler thinks it will. |
| | 472 | Eventually we may switch to a scheme where the client |
| | 473 | dynamically determines the CPU usage. |
| | 474 | For now, though, we'll just use the scheduler's estimate. |
| | 475 | |
| | 476 | The granted credit for a job J is proportional to PFC(J), |
| | 477 | but is normalized in the following ways: |
| | 478 | |
| | 479 | == Cross-version normalization == |
| | 480 | |
| | 481 | If a given application has multiple versions (e.g., CPU and GPU versions) |
| | 482 | the granted credit per job is adjusted |
| | 483 | so that the average is the same for each version. |
| | 484 | The adjustment is always downwards: |
| | 485 | we maintain the average PFC^mean^(V) of PFC() for each app version V, |
| | 486 | find the minimum X. |
| | 487 | An app version V's jobs are then scaled by the factor |
| | 488 | |
| | 489 | S(V) = (X/PFC^mean^(V)) |
| | 490 | |
| | 491 | |
| | 492 | The result for a given job J |
| | 493 | is called "Version-Normalized Peak FLOP Count", or VNPFC(J): |
| | 494 | |
| | 495 | VNPFC(J) = PFC(J) * (X/PFC^mean^(V)) |
| | 496 | |
| | 497 | Notes: |
| | 498 | * This addresses the common situation |
| | 499 | where an app's GPU version is much less efficient than the CPU version |
| | 500 | (i.e. the ratio of actual FLOPs to peak FLOPs is much less). |
| | 501 | To a certain extent, this mechanism shifts the system |
| | 502 | towards the "Actual FLOPs" philosophy, |
| | 503 | since credit is granted based on the most efficient app version. |
| | 504 | It's not exactly "Actual FLOPs", since the most efficient |
| | 505 | version may not be 100% efficient. |
| | 506 | * There are two sources of variance in PFC(V): |
| | 507 | the variation in host efficiency, |
| | 508 | and possibly the variation in job size. |
| | 509 | If we have an ''a priori'' estimate of job size |
| | 510 | (e.g., workunit.rsc_fpops_est) |
| | 511 | we can normalize by this to reduce the variance, |
| | 512 | and make PFC^mean^(V) converge more quickly. |
| | 513 | * ''a posteriori'' estimates of job size may exist also |
| | 514 | (e.g., an iteration count reported by the app) |
| | 515 | but using this for anything introduces a new cheating risk, |
| | 516 | so it's probably better not to. |
| | 517 | |
| | 518 | |
| | 519 | == Cross-project normalization == |
| | 520 | |
| | 521 | If an application has both CPU and GPU versions, |
| | 522 | then the version normalization mechanism uses the CPU |
| | 523 | version as a "sanity check" to limit the credit granted to GPU jobs. |
| | 524 | |
| | 525 | Suppose a project has an app with only a GPU version, |
| | 526 | so there's no CPU version to act as a sanity check. |
| | 527 | If we grant credit based only on GPU peak speed, |
| | 528 | the project will grant much more credit per GPU hour than other projects, |
| | 529 | violating limited project neutrality. |
| | 530 | |
| | 531 | A solution to this: if an app has only GPU versions, |
| | 532 | then for each version V we let |
| | 533 | S(V) be the average scaling factor |
| | 534 | for that GPU type among projects that do have both CPU and GPU versions. |
| | 535 | This factor is obtained from a central BOINC server. |
| | 536 | V's jobs are then scaled by S(V) as above. |
| | 537 | |
| | 538 | Notes: |
| | 539 | |
| | 540 | * Projects will run a periodic script to update the scaling factors. |
| | 541 | * Rather than GPU type, we'll probably use plan class, |
| | 542 | since e.g. the average efficiency of CUDA 2.3 apps may be different |
| | 543 | than that of CUDA 2.1 apps. |
| | 544 | * Initially we'll obtain scaling factors from large projects |
| | 545 | that have both GPU and CPU apps (e.g., SETI@home). |
| | 546 | Eventually we'll use an average (weighted by work done) over multiple projects |
| | 547 | (see below). |
| | 548 | |
| | 549 | == Host normalization == |
| | 550 | |
| | 551 | Assuming that hosts are sent jobs for a given app uniformly, |
| | 552 | then, for that app, |
| | 553 | hosts should get the same average granted credit per job. |
| | 554 | To ensure this, for each application A we maintain the average VNPFC^mean^(A), |
| | 555 | and for each host H we maintain VNPFC^mean^(H, A). |
| | 556 | The '''claimed FLOPS''' for a given job J is then |
| | 557 | |
| | 558 | F = VNPFC(J) * (VNPFC^mean^(A)/VNPFC^mean^(H, A)) |
| | 559 | |
| | 560 | and the claimed credit (in Cobblestones) is |
| | 561 | |
| | 562 | C = F*100/86400e9 |
| | 563 | |
| | 564 | There are some cases where hosts are not sent jobs uniformly: |
| | 565 | * job-size matching (smaller jobs sent to slower hosts) |
| | 566 | * GPUGrid.net's scheme for sending some (presumably larger) |
| | 567 | jobs to GPUs with more processors. |
| | 568 | In these cases average credit per job must differ between hosts, |
| | 569 | according to the types of jobs that are sent to them. |
| | 570 | |
| | 571 | This can be done by dividing |
| | 572 | each sample in the computation of VNPFC^mean^ by WU.rsc_fpops_est |
| | 573 | (in fact, there's no reason not to always do this). |
| | 574 | |
| | 575 | Notes: |
| | 576 | * The host normalization mechanism reduces the claimed credit of hosts |
| | 577 | that are less efficient than average, |
| | 578 | and increases the claimed credit of hosts that are more efficient |
| | 579 | than average. |
| | 580 | * VNPFC^mean^ is averaged over jobs, not hosts. |
| | 581 | |
| | 582 | == Computing averages == |
| | 583 | |
| | 584 | We need to compute averages carefully because |
| | 585 | |
| | 586 | * The quantities being averaged may gradually change over time |
| | 587 | (e.g. average job size may change, |
| | 588 | app version efficiency may change as new versions are deployed) |
| | 589 | and we need to track this. |
| | 590 | * A given sample may be wildly off, |
| | 591 | and we can't let this mess up the average. |
| | 592 | * Averages should be weighted by job size. |
| | 593 | |
| | 594 | In addition, we may as well maintain the variance of the quantities, |
| | 595 | although the current system doesn't use it. |
| | 596 | |
| | 597 | The code that does all this is |
| | 598 | [http://boinc.berkeley.edu/trac/browser/trunk/boinc/lib/average.h here]. |
| | 599 | |
| | 600 | == Cross-project scaling factors == |
| | 601 | |
| | 602 | We'll have a script that publishes a project's |
| | 603 | accounting data (see Implementation). |
| | 604 | The BOINC web site will collect these from a few big projects |
| | 605 | and publish the averages. |
| | 606 | |
| | 607 | == Replication and cheating == |
| | 608 | |
| | 609 | Host normalization mostly eliminates the incentive to cheat |
| | 610 | by claiming excessive credit |
| | 611 | (i.e., by falsifying benchmark scores or elapsed time). |
| | 612 | An exaggerated claim will increase VNPFC*(H,A), |
| | 613 | causing subsequent claimed credit to be scaled down proportionately. |
| | 614 | This means that no special cheat-prevention scheme |
| | 615 | is needed for single replications; |
| | 616 | granted credit = claimed credit. |
| | 617 | |
| | 618 | For jobs that are replicated, granted credit should be |
| | 619 | set to the min of the valid results |
| | 620 | (min is used instead of average to remove the incentive |
| | 621 | for cherry-picking, see below). |
| | 622 | |
| | 623 | However, there are still some possible forms of cheating. |
| | 624 | |
| | 625 | * One-time cheats (like claiming 1e304) can be prevented by |
| | 626 | capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A). |
| | 627 | * Cherry-picking: suppose an application has two types of jobs, |
| | 628 | which run for 1 second and 1 hour respectively. |
| | 629 | Clients can figure out which is which, e.g. by running a job for 2 seconds |
| | 630 | and seeing if it's exited. |
| | 631 | Suppose a client systematically refuses the 1 hour jobs |
| | 632 | (e.g., by reporting a crash or never reporting them). |
| | 633 | Its VNPFC^mean^(H, A) will quickly decrease, |
| | 634 | and soon it will be getting several thousand times more credit |
| | 635 | per actual work than other hosts! |
| | 636 | Countermeasure: |
| | 637 | whenever a job errors out, times out, or fails to validate, |
| | 638 | set the host's error rate back to the initial default, |
| | 639 | and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A. |
| | 640 | This puts the host to a state where several dozen of its |
| | 641 | subsequent jobs will be replicated. |
| | 642 | |
| | 643 | == Trickle credit == |
| | 644 | |
| | 645 | |
| | 646 | == Job runtime estimates == |
| | 647 | |
| | 648 | Unrelated to the credit proposal, but in a similar spirit. |
| | 649 | The server will maintain ET^mean^(H, V), the statistics of |
| | 650 | job runtimes (normalized by wu.rsc_fpops_est) per |
| | 651 | host and application version. |
| | 652 | |
| | 653 | The server's estimate of a job's runtime is then |
| | 654 | |
| | 655 | R(J, H) = wu.rsc_fpops_est * ET^mean^(H, V) |
| | 656 | |
| | 657 | |
| | 658 | == Error rate,host punishment, and turnaround time estimation == |
| | 659 | |
| | 660 | Unrelated to the credit proposal, but in a similar spirit. |
| | 661 | |
| | 662 | Due to hardware problems (e.g. a malfunctioning GPU) |
| | 663 | a host may have a 100% error rate for one app version |
| | 664 | and a 0% error rate for another. |
| | 665 | Similar for turnaround time. |
| | 666 | |
| | 667 | The host punishment mechanism is designed to |
| | 668 | deal with malfunctioning hosts. |
| | 669 | For each host the server maintains '''max_results_day'''. |
| | 670 | This is initialized to a project-specified value (e.g. 200) |
| | 671 | and scaled by the number of CPUs and/or GPUs. |
| | 672 | It's decremented if the client reports a crash |
| | 673 | (but not if the job was aborted). |
| | 674 | It's doubled when a successful (but not necessarily valid) |
| | 675 | result is received. |
| | 676 | |
| | 677 | So we'll move the "error_rate", "turnaround_time" |
| | 678 | and "max_results_day" fields from the host table to host_app_version. |
| | 679 | |
| | 680 | == Cherry picking == |
| | 681 | |
| | 682 | Suppose an application has a mix of long and short jobs. |
| | 683 | If a client intentionally discards |
| | 684 | (or aborts, or reports errors from) the long jobs, |
| | 685 | but completes the short jobs, |
| | 686 | its host scaling factor will become large, |
| | 687 | and it will get excessive credit for the short jobs. |
| | 688 | This is called "cherry picking". |
| | 689 | |
| | 690 | Note: the host punishment mechanism |
| | 691 | doesn't deal effectively with cherry picking, |
| | 692 | |
| | 693 | We propose the following mechanism to deal with cherry picking: |
| | 694 | |
| | 695 | * For each (host, app version), maintain "host_scale_time". |
| | 696 | This is the earliest time at which host scaling will be applied. |
| | 697 | This is set initially, and each time a job times out |
| | 698 | or errors out, to now+X, where X is the app's delay bound. |
| | 699 | |
| | 700 | The idea is to apply the host scaling factor |
| | 701 | only if there's solid evidence that the host is NOT cherry picking. |
| | 702 | |
| | 703 | == Implementation == |
| | 704 | |
| | 705 | === Database changes === |
| | 706 | |
| | 707 | New table '''host_app''': |
| | 708 | {{{ |
| | 709 | int host_id; |
| | 710 | int app_id; |
| | 711 | int vnpfc_n; |
| | 712 | double vnpfc_sum; |
| | 713 | double vnpfc_exp_avg; |
| | 714 | }}} |
| | 715 | |
| | 716 | New table '''host_app_version''': |
| | 717 | {{{ |
| | 718 | int host_id; |
| | 719 | int app_version_id; |
| | 720 | int et_n; |
| | 721 | double et_sum; |
| | 722 | double et_exp_avg; |
| | 723 | // some variable for recent error rate, |
| | 724 | // replacing host.error_rate and host.max_results_day |
| | 725 | // make sure that a host w/ 1 good and 1 bad GPU gets few GPU jobs |
| | 726 | }}} |
| | 727 | |
| | 728 | New fields in '''app_version''': |
| | 729 | {{{ |