343 | | = New credit system design = |
344 | | |
345 | | == Peak FLOPS and efficiency == |
346 | | |
347 | | BOINC estimates the peak FLOPS of each processor. |
348 | | For CPUs, this is the Whetstone benchmark score. |
349 | | For GPUs, it's given by a manufacturer-supplied formula. |
350 | | |
351 | | However, other factors affect application performance. |
352 | | For example, applications access memory, |
353 | | and the speed of a host's memory system is not reflected |
354 | | in its Whetstone score. |
355 | | So a given job might take the same amount of CPU time |
356 | | and a 1 GFLOPS host as on a 10 GFLOPS host. |
357 | | The "efficiency" of an application running on a given host |
358 | | is the ratio of actual FLOPS to peak FLOPS. |
359 | | |
360 | | GPUs typically have a much higher (50-100X) peak FLOPS than CPUs. |
361 | | However, application efficiency is typically lower |
362 | | (very roughly, 10% for GPUs, 50% for CPUs). |
363 | | |
364 | | Notes: |
365 | | |
366 | | * The peaks FLOPS of a device is single or double precision, |
367 | | whichever is higher. |
368 | | Differentiating between single and double would unnecessarily |
369 | | complicate things, and the distinction will disappear soon anyway. |
370 | | |
371 | | == Credit system goals == |
372 | | |
373 | | Some possible goals in designing a credit system: |
374 | | |
375 | | * Device neutrality: similar jobs should get similar credit |
376 | | regardless of what processor or GPU they run on. |
377 | | |
378 | | * Project neutrality: different projects should grant |
379 | | about the same amount of credit per day for a given processor. |
380 | | |
381 | | It's easy to show that both goals can't be satisfied simultaneously. |
382 | | |
383 | | == The first credit system == |
384 | | |
385 | | In the first iteration of BOINC's credit system, |
386 | | "claimed credit" was defined as |
387 | | {{{ |
388 | | C1 = H.whetstone * J.cpu_time |
389 | | }}} |
390 | | There were then various schemes for taking the |
391 | | average or min claimed credit of the replicas of a job, |
392 | | and using that as the "granted credit". |
393 | | |
394 | | We call this system "Peak-FLOPS-based" because |
395 | | it's based on the CPU's peak performance. |
396 | | |
397 | | The problem with this system is that, for a given app version, |
398 | | efficiency can vary widely between hosts. |
399 | | In the above example, |
400 | | the 10 GFLOPS host would claim 10X as much credit, |
401 | | and its owner would be upset when it was granted only a tenth of that. |
402 | | |
403 | | Furthermore, the credits granted to a given host for a |
404 | | series of identical jobs could vary widely, |
405 | | depending on the host it was paired with by replication. |
406 | | This seemed arbitrary and unfair to users. |
407 | | |
408 | | == The second credit system == |
409 | | |
410 | | We then switched to the philosophy that |
411 | | credit should be proportional to number of FLOPs actually performed |
412 | | by the application. |
413 | | We added API calls to let applications report this. |
414 | | We call this approach "Actual-FLOPs-based". |
415 | | |
416 | | SETI@home's application allowed counting of FLOPs, |
417 | | and they adopted this system, |
418 | | adding a scaling factor so that average credit per job |
419 | | was the same as the first credit system. |
420 | | |
421 | | Not all projects could count FLOPs, however. |
422 | | So SETI@home published their average credit per CPU second, |
423 | | and other projects continued to use benchmark-based credit, |
424 | | but multiplied it by a scaling factor to match SETI@home's average. |
425 | | |
426 | | This system had several problems: |
427 | | |
428 | | * It didn't address GPUs. |
429 | | * Project that couldn't count FLOPs still had device neutrality problems. |
430 | | * It didn't prevent credit cheating when single replication was used. |
431 | | |
432 | | |
433 | | == Goals of the new (third) credit system == |
434 | | |
435 | | * Completely automated - projects don't have to |
436 | | change code, settings, etc. |
437 | | |
438 | | * Device neutrality |
439 | | |
440 | | * Limited project neutrality: different projects should grant |
441 | | about the same amount of credit per CPU hour, averaged over hosts. |
442 | | Projects with GPU apps should grant credit in proportion |
443 | | to the efficiency of the apps. |
444 | | (This means that projects with efficient GPU apps will |
445 | | grant more credit on average. That's OK). |
446 | | |
447 | | == Peak FLOP Count (PFC) == |
448 | | |
449 | | This system uses the Peak-FLOPS-based approach, |
450 | | but addresses its problems in a new way. |
451 | | |
452 | | When a job is issued to a host, the scheduler specifies usage(J,D), |
453 | | J's usage of processing resource D: |
454 | | how many CPUs and how many GPUs (possibly fractional). |
455 | | |
456 | | If the job is finished in elapsed time T, |
457 | | we define peak_flop_count(J), or PFC(J) as |
458 | | {{{ |
459 | | PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D)) |
460 | | }}} |
461 | | |
462 | | Notes: |
463 | | |
464 | | * We use elapsed time instead of actual device time (e.g., CPU time). |
465 | | If a job uses a resource inefficiently |
466 | | (e.g., a CPU job that does lots of disk I/O) |
467 | | PFC() won't reflect this. That's OK. |
468 | | The key thing is that BOINC reserved the device for the job, |
469 | | whether or not the job used it efficiently. |
470 | | * usage(J,D) may not be accurate; e.g., a GPU job may take |
471 | | more or less CPU than the scheduler thinks it will. |
472 | | Eventually we may switch to a scheme where the client |
473 | | dynamically determines the CPU usage. |
474 | | For now, though, we'll just use the scheduler's estimate. |
475 | | |
476 | | The granted credit for a job J is proportional to PFC(J), |
477 | | but is normalized in the following ways: |
478 | | |
479 | | == Cross-version normalization == |
480 | | |
481 | | If a given application has multiple versions (e.g., CPU and GPU versions) |
482 | | the granted credit per job is adjusted |
483 | | so that the average is the same for each version. |
484 | | The adjustment is always downwards: |
485 | | we maintain the average PFC^mean^(V) of PFC() for each app version V, |
486 | | find the minimum X. |
487 | | An app version V's jobs are then scaled by the factor |
488 | | |
489 | | S(V) = (X/PFC^mean^(V)) |
490 | | |
491 | | |
492 | | The result for a given job J |
493 | | is called "Version-Normalized Peak FLOP Count", or VNPFC(J): |
494 | | |
495 | | VNPFC(J) = PFC(J) * (X/PFC^mean^(V)) |
496 | | |
497 | | Notes: |
498 | | * This addresses the common situation |
499 | | where an app's GPU version is much less efficient than the CPU version |
500 | | (i.e. the ratio of actual FLOPs to peak FLOPs is much less). |
501 | | To a certain extent, this mechanism shifts the system |
502 | | towards the "Actual FLOPs" philosophy, |
503 | | since credit is granted based on the most efficient app version. |
504 | | It's not exactly "Actual FLOPs", since the most efficient |
505 | | version may not be 100% efficient. |
506 | | * There are two sources of variance in PFC(V): |
507 | | the variation in host efficiency, |
508 | | and possibly the variation in job size. |
509 | | If we have an ''a priori'' estimate of job size |
510 | | (e.g., workunit.rsc_fpops_est) |
511 | | we can normalize by this to reduce the variance, |
512 | | and make PFC^mean^(V) converge more quickly. |
513 | | * ''a posteriori'' estimates of job size may exist also |
514 | | (e.g., an iteration count reported by the app) |
515 | | but using this for anything introduces a new cheating risk, |
516 | | so it's probably better not to. |
517 | | |
518 | | |
519 | | == Cross-project normalization == |
520 | | |
521 | | If an application has both CPU and GPU versions, |
522 | | then the version normalization mechanism uses the CPU |
523 | | version as a "sanity check" to limit the credit granted to GPU jobs. |
524 | | |
525 | | Suppose a project has an app with only a GPU version, |
526 | | so there's no CPU version to act as a sanity check. |
527 | | If we grant credit based only on GPU peak speed, |
528 | | the project will grant much more credit per GPU hour than other projects, |
529 | | violating limited project neutrality. |
530 | | |
531 | | A solution to this: if an app has only GPU versions, |
532 | | then for each version V we let |
533 | | S(V) be the average scaling factor |
534 | | for that GPU type among projects that do have both CPU and GPU versions. |
535 | | This factor is obtained from a central BOINC server. |
536 | | V's jobs are then scaled by S(V) as above. |
537 | | |
538 | | Notes: |
539 | | |
540 | | * Projects will run a periodic script to update the scaling factors. |
541 | | * Rather than GPU type, we'll probably use plan class, |
542 | | since e.g. the average efficiency of CUDA 2.3 apps may be different |
543 | | than that of CUDA 2.1 apps. |
544 | | * Initially we'll obtain scaling factors from large projects |
545 | | that have both GPU and CPU apps (e.g., SETI@home). |
546 | | Eventually we'll use an average (weighted by work done) over multiple projects |
547 | | (see below). |
548 | | |
549 | | == Host normalization == |
550 | | |
551 | | Assuming that hosts are sent jobs for a given app uniformly, |
552 | | then, for that app, |
553 | | hosts should get the same average granted credit per job. |
554 | | To ensure this, for each application A we maintain the average VNPFC^mean^(A), |
555 | | and for each host H we maintain VNPFC^mean^(H, A). |
556 | | The '''claimed FLOPS''' for a given job J is then |
557 | | |
558 | | F = VNPFC(J) * (VNPFC^mean^(A)/VNPFC^mean^(H, A)) |
559 | | |
560 | | and the claimed credit (in Cobblestones) is |
561 | | |
562 | | C = F*100/86400e9 |
563 | | |
564 | | There are some cases where hosts are not sent jobs uniformly: |
565 | | * job-size matching (smaller jobs sent to slower hosts) |
566 | | * GPUGrid.net's scheme for sending some (presumably larger) |
567 | | jobs to GPUs with more processors. |
568 | | In these cases average credit per job must differ between hosts, |
569 | | according to the types of jobs that are sent to them. |
570 | | |
571 | | This can be done by dividing |
572 | | each sample in the computation of VNPFC^mean^ by WU.rsc_fpops_est |
573 | | (in fact, there's no reason not to always do this). |
574 | | |
575 | | Notes: |
576 | | * The host normalization mechanism reduces the claimed credit of hosts |
577 | | that are less efficient than average, |
578 | | and increases the claimed credit of hosts that are more efficient |
579 | | than average. |
580 | | * VNPFC^mean^ is averaged over jobs, not hosts. |
581 | | |
582 | | == Computing averages == |
583 | | |
584 | | We need to compute averages carefully because |
585 | | |
586 | | * The quantities being averaged may gradually change over time |
587 | | (e.g. average job size may change, |
588 | | app version efficiency may change as new versions are deployed) |
589 | | and we need to track this. |
590 | | * A given sample may be wildly off, |
591 | | and we can't let this mess up the average. |
592 | | * Averages should be weighted by job size. |
593 | | |
594 | | In addition, we may as well maintain the variance of the quantities, |
595 | | although the current system doesn't use it. |
596 | | |
597 | | The code that does all this is |
598 | | [http://boinc.berkeley.edu/trac/browser/trunk/boinc/lib/average.h here]. |
599 | | |
600 | | == Cross-project scaling factors == |
601 | | |
602 | | We'll have a script that publishes a project's |
603 | | accounting data (see Implementation). |
604 | | The BOINC web site will collect these from a few big projects |
605 | | and publish the averages. |
606 | | |
607 | | == Replication and cheating == |
608 | | |
609 | | Host normalization mostly eliminates the incentive to cheat |
610 | | by claiming excessive credit |
611 | | (i.e., by falsifying benchmark scores or elapsed time). |
612 | | An exaggerated claim will increase VNPFC*(H,A), |
613 | | causing subsequent claimed credit to be scaled down proportionately. |
614 | | This means that no special cheat-prevention scheme |
615 | | is needed for single replications; |
616 | | granted credit = claimed credit. |
617 | | |
618 | | For jobs that are replicated, granted credit should be |
619 | | set to the min of the valid results |
620 | | (min is used instead of average to remove the incentive |
621 | | for cherry-picking, see below). |
622 | | |
623 | | However, there are still some possible forms of cheating. |
624 | | |
625 | | * One-time cheats (like claiming 1e304) can be prevented by |
626 | | capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A). |
627 | | * Cherry-picking: suppose an application has two types of jobs, |
628 | | which run for 1 second and 1 hour respectively. |
629 | | Clients can figure out which is which, e.g. by running a job for 2 seconds |
630 | | and seeing if it's exited. |
631 | | Suppose a client systematically refuses the 1 hour jobs |
632 | | (e.g., by reporting a crash or never reporting them). |
633 | | Its VNPFC^mean^(H, A) will quickly decrease, |
634 | | and soon it will be getting several thousand times more credit |
635 | | per actual work than other hosts! |
636 | | Countermeasure: |
637 | | whenever a job errors out, times out, or fails to validate, |
638 | | set the host's error rate back to the initial default, |
639 | | and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A. |
640 | | This puts the host to a state where several dozen of its |
641 | | subsequent jobs will be replicated. |
642 | | |
643 | | == Trickle credit == |
644 | | |
645 | | CPDN breaks jobs into segments, |
646 | | sends a trickle-up message for each segment, |
647 | | and grants credit for each completed segment. |
648 | | In this case, |
649 | | the trickle message handlers should not grant a fixed amount of credit. |
650 | | Instead, the trickle-up messages should contain |
651 | | an "incremental elapsed time" field. |
652 | | |
653 | | == Job runtime estimates == |
654 | | |
655 | | Unrelated to the credit proposal, but in a similar spirit. |
656 | | The server will maintain ET^mean^(H, V), the statistics of |
657 | | job runtimes (normalized by wu.rsc_fpops_est) per |
658 | | host and application version. |
659 | | |
660 | | The server's estimate of a job's runtime is then |
661 | | |
662 | | R(J, H) = wu.rsc_fpops_est * ET^mean^(H, V) |
663 | | |
664 | | |
665 | | == Error rate, host punishment, and turnaround time estimation == |
666 | | |
667 | | Unrelated to the credit proposal, but in a similar spirit. |
668 | | |
669 | | Due to hardware problems (e.g. a malfunctioning GPU) |
670 | | a host may have a 100% error rate for one app version |
671 | | and a 0% error rate for another. |
672 | | Similar for turnaround time. |
673 | | |
674 | | So we'll move the "error_rate" and "turnaround_time" |
675 | | fields from the host table to host_app_version. |
676 | | |
677 | | The host punishment mechanism is designed to deal with malfunctioning hosts. |
678 | | For each host the server maintains '''max_results_day'''. |
679 | | This is initialized to a project-specified value (e.g. 200) |
680 | | and scaled by the number of CPUs and/or GPUs. |
681 | | It's decremented if the client reports a crash |
682 | | (but not if the job was aborted). |
683 | | It's doubled when a successful (but not necessarily valid) |
684 | | result is received. |
685 | | |
686 | | This should also be per-app-version, |
687 | | so we'll move "max_results_day" from the host table to host_app_version. |
688 | | |
689 | | == Cherry picking == |
690 | | |
691 | | Suppose an application has a mix of long and short jobs. |
692 | | If a client intentionally discards |
693 | | (or aborts, or reports errors from) the long jobs, |
694 | | but completes the short jobs, |
695 | | its host scaling factor will become large, |
696 | | and it will get excessive credit for the short jobs. |
697 | | This is called "cherry picking". |
698 | | |
699 | | The host punishment mechanism |
700 | | doesn't deal effectively with cherry picking, |
701 | | |
702 | | We propose the following mechanism to deal with cherry picking: |
703 | | |
704 | | * For each (host, app version) maintain "host_scale_time". |
705 | | This is the earliest time at which host scaling will be applied. |
706 | | * for each (host, app version) maintain "scale_probation" |
707 | | (initially true). |
708 | | * When send a job to a host, |
709 | | if scale_probation is true, |
710 | | set host_scale_time to now+X, where X is the app's delay bound. |
711 | | * When a job is successfully validated, |
712 | | and now > host_scale_time, |
713 | | set scale_probation to false. |
714 | | * If a job times out or errors out, |
715 | | set scale_probation to true, |
716 | | max the scale factor with 1, |
717 | | and set host_scale_time to now+X. |
718 | | * when computing claimed credit for a job, |
719 | | and now < host_scale_time, don't use the host scale factor |
720 | | |
721 | | The idea is to apply the host scaling factor |
722 | | only if there's solid evidence that the host is NOT cherry picking. |
723 | | |
724 | | Because this mechanism is punitive to hosts |
725 | | that experience actual failures, |
726 | | we'll make it selectable on a per-application basis (default off). |
727 | | |
728 | | In addition, to limit the extent of cheating |
729 | | (in case the above mechanism is defeated somehow) |
730 | | the host scaling factor will be min'd with a |
731 | | project-wide config parameter (default, say, 3). |
732 | | |
733 | | == Implementation == |
734 | | |
735 | | === Database changes === |
736 | | |
737 | | New table '''host_app''': |
738 | | {{{ |
739 | | int host_id; |
740 | | int app_id; |
741 | | int vnpfc_n; |
742 | | double vnpfc_sum; |
743 | | double vnpfc_exp_avg; |
744 | | }}} |
745 | | |
746 | | New table '''host_app_version''': |
747 | | {{{ |
748 | | int host_id; |
749 | | int app_version_id; |
750 | | int et_n; |
751 | | double et_sum; |
752 | | double et_exp_avg; |
753 | | // some variable for recent error rate, |
754 | | // replacing host.error_rate and host.max_results_day |
755 | | // make sure that a host w/ 1 good and 1 bad GPU gets few GPU jobs |
756 | | }}} |
757 | | |
758 | | New fields in '''app_version''': |
759 | | {{{ |