wiki:CreditNew

Version 15 (modified by davea, 15 years ago) (diff)

--

New credit system design

Peak FLOPS and efficiency

BOINC estimates the peak FLOPS of each processor. For CPUs, this is the Whetstone benchmark score. For GPUs, it's given by a manufacturer-supplied formula.

However, other factors affect application performance. For example, applications access memory, and the speed of a host's memory system is not reflected in its Whetstone score. So a given job might take the same amount of CPU time and a 1 GFLOPS host as on a 10 GFLOPS host. The "efficiency" of an application running on a given host is the ratio of actual FLOPS to peak FLOPS.

GPUs typically have a much higher (50-100X) peak FLOPS than CPUs. However, application efficiency is typically lower (very roughly, 10% for GPUs, 50% for CPUs).

Notes:

  • The peaks FLOPS of a device is single or double precision, whichever is higher. Differentiating between single and double would unnecessarily complicate things, and the distinction will disappear soon anyway.

Credit system goals

Some possible goals in designing a credit system:

  • Device neutrality: similar jobs should get similar credit regardless of what processor or GPU they run on.
  • Project neutrality: different projects should grant about the same amount of credit per day for a given processor.

It's easy to show that both goals can't be satisfied simultaneously.

The first credit system

In the first iteration of BOINC's credit system, "claimed credit" was defined as

C1 = H.whetstone * J.cpu_time

There were then various schemes for taking the average or min claimed credit of the replicas of a job, and using that as the "granted credit".

We call this system "Peak-FLOPS-based" because it's based on the CPU's peak performance.

The problem with this system is that, for a given app version, efficiency can vary widely between hosts. In the above example, the 10 GFLOPS host would claim 10X as much credit, and its owner would be upset when it was granted only a tenth of that.

Furthermore, the credits granted to a given host for a series of identical jobs could vary widely, depending on the host it was paired with by replication. This seemed arbitrary and unfair to users.

The second credit system

We then switched to the philosophy that credit should be proportional to number of FLOPs actually performed by the application. We added API calls to let applications report this. We call this approach "Actual-FLOPs-based".

SETI@home's application allowed counting of FLOPs, and they adopted this system, adding a scaling factor so that average credit per job was the same as the first credit system.

Not all projects could count FLOPs, however. So SETI@home published their average credit per CPU second, and other projects continued to use benchmark-based credit, but multiplied it by a scaling factor to match SETI@home's average.

This system had several problems:

  • It didn't address GPUs.
  • Project that couldn't count FLOPs still had device neutrality problems.
  • It didn't prevent credit cheating when single replication was used.

Goals of the new (third) credit system

  • Completely automated - projects don't have to change code, settings, etc.
  • Device neutrality
  • Limited project neutrality: different projects should grant about the same amount of credit per CPU hour, averaged over hosts. Projects with GPU apps should grant credit in proportion to the efficiency of the apps. (This means that projects with efficient GPU apps will grant more credit on average. That's OK).

Peak FLOP Count (PFC)

This system uses the Peak-FLOPS-based approach, but addresses its problems in a new way.

When a job is issued to a host, the scheduler specifies usage(J,D), J's usage of processing resource D: how many CPUs and how many GPUs (possibly fractional).

If the job is finished in elapsed time T, we define peak_flop_count(J), or PFC(J) as

PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D))

Notes:

  • We use elapsed time instead of actual device time (e.g., CPU time). If a job uses a resource inefficiently (e.g., a CPU job that does lots of disk I/O) PFC() won't reflect this. That's OK. The key thing is that BOINC reserved the device for the job, whether or not the job used it efficiently.
  • usage(J,D) may not be accurate; e.g., a GPU job may take more or less CPU than the scheduler thinks it will. Eventually we may switch to a scheme where the client dynamically determines the CPU usage. For now, though, we'll just use the scheduler's estimate.

The granted credit for a job J is proportional to PFC(J), but is normalized in the following ways:

Cross-version normalization

If a given application has multiple versions (e.g., CPU and GPU versions) the granted credit per job is adjusted so that the average is the same for each version. The adjustment is always downwards: we maintain the average PFCmean(V) of PFC() for each app version V, find the minimum X. An app version V's jobs are then scaled by the factor

S(V) = (X/PFCmean(V))

The result for a given job J is called "Version-Normalized Peak FLOP Count", or VNPFC(J):

VNPFC(J) = PFC(J) * (X/PFCmean(V))

Notes:

  • This addresses the common situation where an app's GPU version is much less efficient than the CPU version (i.e. the ratio of actual FLOPs to peak FLOPs is much less). To a certain extent, this mechanism shifts the system towards the "Actual FLOPs" philosophy, since credit is granted based on the most efficient app version. It's not exactly "Actual FLOPs", since the most efficient version may not be 100% efficient.
  • There are two sources of variance in PFC(V): the variation in host efficiency, and possibly the variation in job size. If we have an a priori estimate of job size (e.g., workunit.rsc_fpops_est) we can normalize by this to reduce the variance, and make PFCmean(V) converge more quickly.
  • a posteriori estimates of job size may exist also (e.g., an iteration count reported by the app) but using this for anything introduces a new cheating risk, so it's probably better not to.

Cross-project normalization

If an application has both CPU and GPU versions, then the version normalization mechanism uses the CPU version as a "sanity check" to limit the credit granted to GPU jobs.

Suppose a project has an app with only a GPU version, so there's no CPU version to act as a sanity check. If we grant credit based only on GPU peak speed, the project will grant much more credit per GPU hour than other projects, violating limited project neutrality.

A solution to this: if an app has only GPU versions, then for each version V we let S(V) be the average scaling factor for that GPU type among projects that do have both CPU and GPU versions. This factor is obtained from a central BOINC server. V's jobs are then scaled by S(V) as above.

Notes:

  • Projects will run a periodic script to update the scaling factors.
  • Rather than GPU type, we'll probably use plan class, since e.g. the average efficiency of CUDA 2.3 apps may be different than that of CUDA 2.1 apps.
  • Initially we'll obtain scaling factors from large projects that have both GPU and CPU apps (e.g., SETI@home). Eventually we'll use an average (weighted by work done) over multiple projects (see below).

Host normalization

Assuming that hosts are sent jobs for a given app uniformly, then, for that app, hosts should get the same average granted credit per job. To ensure this, for each application A we maintain the average VNPFCmean(A), and for each host H we maintain VNPFCmean(H, A). The claimed FLOPS for a given job J is then

F = VNPFC(J) * (VNPFCmean(A)/VNPFCmean(H, A))

and the claimed credit (in Cobblestones) is

C = F*100/86400e9

There are some cases where hosts are not sent jobs uniformly:

  • job-size matching (smaller jobs sent to slower hosts)
  • GPUGrid.net's scheme for sending some (presumably larger) jobs to GPUs with more processors.

In these cases average credit per job must differ between hosts, according to the types of jobs that are sent to them.

This can be done by dividing each sample in the computation of VNPFCmean by WU.rsc_fpops_est (in fact, there's no reason not to always do this).

Notes:

  • The host normalization mechanism reduces the claimed credit of hosts that are less efficient than average, and increases the claimed credit of hosts that are more efficient than average.
  • VNPFCmean is averaged over jobs, not hosts.

Computing averages

We need to compute averages carefully because

  • The quantities being averaged may gradually change over time (e.g. average job size may change, app version efficiency may change as new versions are deployed) and we need to track this.
  • A given sample may be wildly off, and we can't let this mess up the average.
  • Averages should be weighted by job size.

In addition, we may as well maintain the variance of the quantities, although the current system doesn't use it.

The code that does all this is here.

Cross-project scaling factors

We'll have a script that publishes a project's accounting data (see Implementation). The BOINC web site will collect these from a few big projects and publish the averages.

Replication and cheating

Host normalization mostly eliminates the incentive to cheat by claiming excessive credit (i.e., by falsifying benchmark scores or elapsed time). An exaggerated claim will increase VNPFC*(H,A), causing subsequent claimed credit to be scaled down proportionately. This means that no special cheat-prevention scheme is needed for single replications; granted credit = claimed credit.

For jobs that are replicated, granted credit should be set to the min of the valid results (min is used instead of average to remove the incentive for cherry-picking, see below).

However, there are still some possible forms of cheating.

  • One-time cheats (like claiming 1e304) can be prevented by capping VNPFC(J) at some multiple (say, 10) of VNPFCmean(A).
  • Cherry-picking: suppose an application has two types of jobs, which run for 1 second and 1 hour respectively. Clients can figure out which is which, e.g. by running a job for 2 seconds and seeing if it's exited. Suppose a client systematically refuses the 1 hour jobs (e.g., by reporting a crash or never reporting them). Its VNPFCmean(H, A) will quickly decrease, and soon it will be getting several thousand times more credit per actual work than other hosts! Countermeasure: whenever a job errors out, times out, or fails to validate, set the host's error rate back to the initial default, and set its VNPFCmean(H, A) to VNPFCmean(A) for all apps A. This puts the host to a state where several dozen of its subsequent jobs will be replicated.

Trickle credit

Job runtime estimates

Unrelated to the credit proposal, but in a similar spirit. The server will maintain ETmean(H, V), the statistics of job runtimes (normalized by wu.rsc_fpops_est) per host and application version.

The server's estimate of a job's runtime is then

R(J, H) = wu.rsc_fpops_est * ETmean(H, V)

Implementation

Database changes

New table host_app:

int    host_id;
int    app_id;
int    vnpfc_n;
double vnpfc_sum;
double vnpfc_exp_avg;

New table host_app_version:

int    host_id;
int    app_version_id;
int    et_n;
double et_sum;
double et_exp_avg;
// some variable for recent error rate,
// replacing host.error_rate and host.max_results_day
// make sure that a host w/ 1 good and 1 bad GPU gets few GPU jobs

New fields in app_version:

= New credit system design =

== Peak FLOPS and efficiency ==

BOINC estimates the peak FLOPS of each processor.
For CPUs, this is the Whetstone benchmark score.
For GPUs, it's given by a manufacturer-supplied formula.

However, other factors affect application performance.
For example, applications access memory,
and the speed of a host's memory system is not reflected
in its Whetstone score.
So a given job might take the same amount of CPU time
and a 1 GFLOPS host as on a 10 GFLOPS host.
The "efficiency" of an application running on a given host
is the ratio of actual FLOPS to peak FLOPS.

GPUs typically have a much higher (50-100X) peak FLOPS than CPUs.
However, application efficiency is typically lower
(very roughly, 10% for GPUs, 50% for CPUs).

Notes:

 * The peaks FLOPS of a device is single or double precision,
   whichever is higher.
   Differentiating between single and double would unnecessarily
   complicate things, and the distinction will disappear soon anyway.

== Credit system goals ==

Some possible goals in designing a credit system:

 * Device neutrality: similar jobs should get similar credit
   regardless of what processor or GPU they run on.

 * Project neutrality: different projects should grant
   about the same amount of credit per day for a given processor.

It's easy to show that both goals can't be satisfied simultaneously.

== The first credit system ==

In the first iteration of BOINC's credit system,
"claimed credit" was defined as
{{{
C1 = H.whetstone * J.cpu_time
}}}
There were then various schemes for taking the
average or min claimed credit of the replicas of a job,
and using that as the "granted credit".

We call this system "Peak-FLOPS-based" because
it's based on the CPU's peak performance.

The problem with this system is that, for a given app version,
efficiency can vary widely between hosts.
In the above example,
the 10 GFLOPS host would claim 10X as much credit,
and its owner would be upset when it was granted only a tenth of that.

Furthermore, the credits granted to a given host for a
series of identical jobs could vary widely,
depending on the host it was paired with by replication.
This seemed arbitrary and unfair to users.

== The second credit system ==

We then switched to the philosophy that
credit should be proportional to number of FLOPs actually performed
by the application.
We added API calls to let applications report this.
We call this approach "Actual-FLOPs-based".

SETI@home's application allowed counting of FLOPs,
and they adopted this system,
adding a scaling factor so that average credit per job
was the same as the first credit system.

Not all projects could count FLOPs, however.
So SETI@home published their average credit per CPU second,
and other projects continued to use benchmark-based credit,
but multiplied it by a scaling factor to match SETI@home's average.

This system had several problems:

 * It didn't address GPUs.
 * Project that couldn't count FLOPs still had device neutrality problems.
 * It didn't prevent credit cheating when single replication was used.


== Goals of the new (third) credit system ==

 * Completely automated - projects don't have to
   change code, settings, etc.

 * Device neutrality

 * Limited project neutrality: different projects should grant
   about the same amount of credit per CPU hour, averaged over hosts.
   Projects with GPU apps should grant credit in proportion
   to the efficiency of the apps.
   (This means that projects with efficient GPU apps will
   grant more credit on average.  That's OK).

== Peak FLOP Count (PFC) ==

This system uses the Peak-FLOPS-based approach,
but addresses its problems in a new way.

When a job is issued to a host, the scheduler specifies usage(J,D),
J's usage of processing resource D:
how many CPUs and how many GPUs (possibly fractional).

If the job is finished in elapsed time T,
we define peak_flop_count(J), or PFC(J) as
{{{
PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D))
}}}

Notes:

 * We use elapsed time instead of actual device time (e.g., CPU time).
   If a job uses a resource inefficiently
   (e.g., a CPU job that does lots of disk I/O)
   PFC() won't reflect this.  That's OK.
   The key thing is that BOINC reserved the device for the job,
   whether or not the job used it efficiently.
 * usage(J,D) may not be accurate; e.g., a GPU job may take
   more or less CPU than the scheduler thinks it will.
   Eventually we may switch to a scheme where the client
   dynamically determines the CPU usage.
   For now, though, we'll just use the scheduler's estimate.

The granted credit for a job J is proportional to PFC(J),
but is normalized in the following ways:

== Cross-version normalization ==

If a given application has multiple versions (e.g., CPU and GPU versions)
the granted credit per job is adjusted
so that the average is the same for each version.
The adjustment is always downwards:
we maintain the average PFC^mean^(V) of PFC() for each app version V,
find the minimum X.
An app version V's jobs are then scaled by the factor

 S(V) = (X/PFC^mean^(V))


The result for a given job J
is called "Version-Normalized Peak FLOP Count", or VNPFC(J):

 VNPFC(J) = PFC(J) * (X/PFC^mean^(V))

Notes:
 * This addresses the common situation
   where an app's GPU version is much less efficient than the CPU version
   (i.e. the ratio of actual FLOPs to peak FLOPs is much less).
   To a certain extent, this mechanism shifts the system
   towards the "Actual FLOPs" philosophy,
   since credit is granted based on the most efficient app version.
   It's not exactly "Actual FLOPs", since the most efficient
   version may not be 100% efficient.
 * There are two sources of variance in PFC(V):
   the variation in host efficiency,
   and possibly the variation in job size.
   If we have an ''a priori'' estimate of job size
   (e.g., workunit.rsc_fpops_est)
   we can normalize by this to reduce the variance,
   and make PFC^mean^(V) converge more quickly.
 * ''a posteriori'' estimates of job size may exist also
   (e.g., an iteration count reported by the app)
   but using this for anything introduces a new cheating risk,
   so it's probably better not to.


== Cross-project normalization ==

If an application has both CPU and GPU versions,
then the version normalization mechanism uses the CPU
version as a "sanity check" to limit the credit granted to GPU jobs.

Suppose a project has an app with only a GPU version,
so there's no CPU version to act as a sanity check.
If we grant credit based only on GPU peak speed,
the project will grant much more credit per GPU hour than other projects,
violating limited project neutrality.

A solution to this: if an app has only GPU versions,
then for each version V we let
S(V) be the average scaling factor
for that GPU type among projects that do have both CPU and GPU versions.
This factor is obtained from a central BOINC server.
V's jobs are then scaled by S(V) as above.

Notes:

 * Projects will run a periodic script to update the scaling factors.
 * Rather than GPU type, we'll probably use plan class,
   since e.g. the average efficiency of CUDA 2.3 apps may be different
   than that of CUDA 2.1 apps.
 * Initially we'll obtain scaling factors from large projects
   that have both GPU and CPU apps (e.g., SETI@home).
   Eventually we'll use an average (weighted by work done) over multiple projects
   (see below).

== Host normalization ==

Assuming that hosts are sent jobs for a given app uniformly,
then, for that app,
hosts should get the same average granted credit per job.
To ensure this, for each application A we maintain the average VNPFC^mean^(A),
and for each host H we maintain VNPFC^mean^(H, A).
The '''claimed FLOPS''' for a given job J is then

 F = VNPFC(J) * (VNPFC^mean^(A)/VNPFC^mean^(H, A))

and the claimed credit (in Cobblestones) is

 C = F*100/86400e9

There are some cases where hosts are not sent jobs uniformly:
 * job-size matching (smaller jobs sent to slower hosts)
 * GPUGrid.net's scheme for sending some (presumably larger)
   jobs to GPUs with more processors.
In these cases average credit per job must differ between hosts,
according to the types of jobs that are sent to them.

This can be done by dividing
each sample in the computation of VNPFC^mean^ by WU.rsc_fpops_est
(in fact, there's no reason not to always do this).

Notes:
 * The host normalization mechanism reduces the claimed credit of hosts
   that are less efficient than average,
   and increases the claimed credit of hosts that are more efficient
   than average.
 * VNPFC^mean^ is averaged over jobs, not hosts.

== Computing averages ==

We need to compute averages carefully because

 * The quantities being averaged may gradually change over time
   (e.g. average job size may change,
   app version efficiency may change as new versions are deployed)
   and we need to track this.
 * A given sample may be wildly off,
   and we can't let this mess up the average.
 * Averages should be weighted by job size.

In addition, we may as well maintain the variance of the quantities,
although the current system doesn't use it.

The code that does all this is
[http://boinc.berkeley.edu/trac/browser/trunk/boinc/lib/average.h here].

== Cross-project scaling factors ==

We'll have a script that publishes a project's
accounting data (see Implementation).
The BOINC web site will collect these from a few big projects
and publish the averages.

== Replication and cheating ==

Host normalization mostly eliminates the incentive to cheat
by claiming excessive credit
(i.e., by falsifying benchmark scores or elapsed time).
An exaggerated claim will increase VNPFC*(H,A),
causing subsequent claimed credit to be scaled down proportionately.
This means that no special cheat-prevention scheme
is needed for single replications;
granted credit = claimed credit.

For jobs that are replicated, granted credit should be
set to the min of the valid results
(min is used instead of average to remove the incentive
for cherry-picking, see below).

However, there are still some possible forms of cheating.

 * One-time cheats (like claiming 1e304) can be prevented by
   capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A).
 * Cherry-picking: suppose an application has two types of jobs,
  which run for 1 second and 1 hour respectively.
  Clients can figure out which is which, e.g. by running a job for 2 seconds
  and seeing if it's exited.
  Suppose a client systematically refuses the 1 hour jobs
  (e.g., by reporting a crash or never reporting them).
  Its VNPFC^mean^(H, A) will quickly decrease,
  and soon it will be getting several thousand times more credit
  per actual work than other hosts!
  Countermeasure:
  whenever a job errors out, times out, or fails to validate,
  set the host's error rate back to the initial default,
  and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A.
  This puts the host to a state where several dozen of its
  subsequent jobs will be replicated.

== Trickle credit ==


== Job runtime estimates ==

Unrelated to the credit proposal, but in a similar spirit.
The server will maintain ET^mean^(H, V), the statistics of
job runtimes (normalized by wu.rsc_fpops_est) per
host and application version.

The server's estimate of a job's runtime is then

 R(J, H) = wu.rsc_fpops_est * ET^mean^(H, V)


== Error rate,host punishment, and turnaround time estimation ==

Unrelated to the credit proposal, but in a similar spirit.

Due to hardware problems (e.g. a malfunctioning GPU)
a host may have a 100% error rate for one app version
and a 0% error rate for another.
Similar for turnaround time.

The host punishment mechanism is designed to
deal with malfunctioning hosts.
For each host the server maintains '''max_results_day'''.
This is initialized to a project-specified value (e.g. 200)
and scaled by the number of CPUs and/or GPUs.
It's decremented if the client reports a crash
(but not if the job was aborted).
It's doubled when a successful (but not necessarily valid)
result is received.

So we'll move the "error_rate", "turnaround_time"
and "max_results_day" fields from the host table to host_app_version.

== Cherry picking ==

Suppose an application has a mix of long and short jobs.
If a client intentionally discards
(or aborts, or reports errors from) the long jobs,
but completes the short jobs,
its host scaling factor will become large,
and it will get excessive credit for the short jobs.
This is called "cherry picking".

Note: the host punishment mechanism
doesn't deal effectively with cherry picking,

We propose the following mechanism to deal with cherry picking:

 * For each (host, app version), maintain "host_scale_time".
   This is the earliest time at which host scaling will be applied.
   This is set initially, and each time a job times out
   or errors out, to now+X, where X is the app's delay bound.

The idea is to apply the host scaling factor
only if there's solid evidence that the host is NOT cherry picking.

== Implementation ==

=== Database changes ===

New table '''host_app''':
{{{
int    host_id;
int    app_id;
int    vnpfc_n;
double vnpfc_sum;
double vnpfc_exp_avg;
}}}

New table '''host_app_version''':
{{{
int    host_id;
int    app_version_id;
int    et_n;
double et_sum;
double et_exp_avg;
// some variable for recent error rate,
// replacing host.error_rate and host.max_results_day
// make sure that a host w/ 1 good and 1 bad GPU gets few GPU jobs
}}}

New fields in '''app_version''':
{{{
int    pfc_n;
double pfc_sum;
double pfc_exp_avg;
double pfc_scaling_factor;
}}}

New fields in '''app''':
{{{
int    vnpfc_n;
double vnpfc_sum;
double vnpfc_exp_avg;
}}}

=== New request message fields ===

=== New reply message fields ===

=== Scheduler changes ===

=== Client changes ===

=== Validator changes ===

=== Server APIs for computing and granting credit ===

== Compatibility ==
int    pfc_n;
double pfc_sum;
double pfc_exp_avg;
double pfc_scaling_factor;

New fields in app:

int    vnpfc_n;
double vnpfc_sum;
double vnpfc_exp_avg;

New request message fields

New reply message fields

Scheduler changes

Client changes

Validator changes

Server APIs for computing and granting credit

Compatibility