Context Navigation

Changes between Version 29 and Version 30 of CreditNew

Timestamp:: Mar 26, 2010, 12:01:12 PM (15 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

CreditNew

-                      v29
+                      v30
 = New credit system design =
 == Definitions ==
+= A new system for runtime estimation and credit =
+== Terminology ==
 BOINC estimates the '''peak FLOPS''' of each processor.
 …
 For GPUs, it's given by a manufacturer-supplied formula.
+Other factors,
+such as the speed of a host's memory system,
+Other factors, such as the speed of a host's memory system,
 affect application performance.
 So a given job might take the same amount of CPU time
 and a 1 GFLOPS host as on a 10 GFLOPS host.
+on 1 GFLOPS and 10 GFLOPS hosts.
 The '''efficiency''' of an application running on a given host
 is the ratio of actual FLOPS to peak FLOPS.
 GPUs typically have a much higher (50-100X) peak FLOPS than CPUs.
+GPUs typically have a higher (10-100X) peak FLOPS than CPUs.
 However, application efficiency is typically lower
 (very roughly, 10% for GPUs, 50% for CPUs).
 …
    about the same amount of credit per host, averaged over all hosts.
+ * Cheat-proof: there should be a bound (say, 1.1)
+   on the ratio of credit granted to credit deserved per user account,
+   regardless of what the user does.
+ * Gaming-resistance: there should be a bound on the
+   impact of faulty or malicious hosts.
 == The first credit system ==
 In the first iteration of BOINC's credit system,
 "claimed credit" was defined as
  C1 = H.whetstone * J.cpu_time
+"claimed credit" C of job J on host H was defined as
+ C = H.whetstone * J.cpu_time
 There were then various schemes for taking the
 …
 it's based on the CPU's peak performance.
+The problem with this system is that, for a given app version,
+efficiency can vary widely between hosts.
+In the above example,
+the 10 GFLOPS host would claim 10X as much credit,
+The problem with this system is that,
+for a given app version, efficiency can vary widely between hosts.
+In the above example, the 10 GFLOPS host would claim 10X as much credit,
 and its owner would be upset when it was granted only a tenth of that.
 …
 We then switched to the philosophy that
 credit should be proportional to number of FLOPs actually performed
+credit should be proportional to the FLOPs actually performed
 by the application.
 We added API calls to let applications report this.
 …
  * Project that can't count FLOPs still have device neutrality problems.
  * It doesn't prevent credit cheating when single replication is used.
 == Goals of the new (third) credit system ==
 …
    grant more credit than projects with inefficient apps.  That's OK).
+== ''A priori'' job size estimates ==
+If we have an ''a priori'' estimate of job size,
+we can normalize by this to reduce the variance
+of various distributions (see below).
+This makes estimates of the means converge more quickly.
+We'll use workunit.rsc_fpops_est as this a priori estimate,
+and denote it E(J).
+(''A posteriori'' estimates of job size may exist also,
+e.g., an iteration count reported by the app,
+but aren't cheat-proof; we don't use them.)
+== ''A priori'' job size estimates and bounds ==
+Projects supply estimates of the FLOPs used by a job
+(wu.rsc_fpops_est)
+and a limit on FLOPS, after which the job will be aborted
+(wu.rsc_fpops_bound).
+Previously, inaccuracy of rsc_fpops_est caused problems.
+The new system still uses rsc_fpops_est,
+but its primary purpose is now to indicate the relative size of jobs.
+Averages of job sizes are normalized by rsc_fpops_est,
+and if rsc_fpops_est is correlated with actual size,
+these averages will converge more quickly.
+We'll denote workunit.rsc_fpops_est as E(J).
+Notes:
+ * ''A posteriori'' estimates of job size may exist also,
+  e.g., an iteration count reported by the app.
+  They aren't cheat-proof, and we don't use them.
 == Peak FLOP Count (PFC) ==
 …
 == Cross-version normalization ==
+A given application may have multiple versions (e.g., CPU and GPU versions).
+A given application may have multiple versions
+(e.g., CPU, multi-thread, and GPU versions).
 If jobs are distributed uniformly to versions,
 all versions should get the same average credit.
 …
    threshold, let X be the min of the averages.
+If X is defined, then for each version V we set
+ Scale(V) = (X/PFC^mean^(V))
+An app version V's jobs are scaled by this factor.
+For each app, we maintain min_avg_pfc(A),
+the average PFC for the most efficient version of A.
+This is an estimate of the app's average actual FLOPS.
 If X is defined, then we set
  min_avg_pfc(A) = X
+This is an estimate of the app's average actual FLOPS.
+We also set
+ Scale(V) = (X/PFC^mean^(V))
+An app version V's jobs are scaled by this factor.
+Otherwise, if a version V is above sample threshold, we set
+ min_avg_pfc(A) = PFC^mean^(V)
 Notes:
 …
    then this mechanism doesn't work as intended.
    One solution is to create separate apps for separate types of jobs.
+ * Cheating or erroneous hosts can influence PFC^mean^(V) to
+   some extent.
+ * Cheating or erroneous hosts can influence PFC^mean^(V) to some extent.
    This is limited by the Sanity Check mechanism,
    and by the fact that only validated jobs are used.
 …
 == Anonymous platform ==
+For anonymous platform apps,
+since we don't reliably know anything about the devices involved,
+we don't try to estimate PFC.
+For each app, we maintain min_avg_pfc(A),
+the average PFC for the most efficient version of A.
+The claimed credit for anonymous platform jobs is
+ claimed_credit^mean^(A)*E(J)
+The server maintains host_app_version records for anonymous platform,
+and it keeps track of elapsed time statistics there.
+These have app_version_id = -2 for CPU, -3 for NVIDIA GPU, -4 for ATI.
+For jobs done by anonymous platform apps,
+the server knows the devices involved and can estimate PFC.
+It maintains host_app_version records for anonymous platform,
+and it keeps track of PFC and elapsed time statistics there.
+There are separate records per resource type.
+The app_version_id encodes the app ID and the resource type
+(-2 for CPU, -3 for NVIDIA GPU, -4 for ATI).
+If min_avg_pfc(A) is defined and
+PFC^mean^(H, V) is above a sample threshold,
+we normalize PFC by the factor
+ min_avg_pfc(A)/PFC^mean^(H, V)
+Otherwise the claimed PFC is
+ min_avg_pfc(A)*E(J)
+If min_avg_pfc(A) is not defined, the claimed PFC is
+ wu.rsc_fpops_est
+== Summary ==
+Given a validated job J, we compute
+ * the "claimed PFC" F
+ * a flag "approx" that is true if F
+   is an approximation and may not be comparable
+   with other instances of the job
+The algorithm:
+ pfc = peak FLOP count(J)
+ approx = true;
+ if pfc > wu.rsc_fpops_bound
+   if min_avg_pfc(A) is defined
+     F = min_avg_pfc(A) * E(J)
+   else
+     F = wu.rsc_fpops_est
+ else
+   if job is anonymous platform
+     hav = host_app_version record
+         if min_avg_pfc(A) is defined
+       if hav.pfc.n > threshold
+             approx = false
+             F = min_avg_pfc(A) /hav.pfc.avg
+           else
+             F = min_avg_pfc(A) * E(J)
+     else
+           F = wu.rsc_fpops_est
+   else
+     F = pfc;
+     if Scale(V) is defined
+           F *= Scale(V)
+         if Scale(H, V) is defined and (H,V) is not on scale probation
+       F *= Scale(H, V)
 == Claimed and granted credit ==
+The '''claimed FLOPS''' for a given job J is
+ F = PFC(J) * S(V) * S(H)
+and the claimed credit (in Cobblestones) is
+ C = F*100/86400e9
+When replication is used,
+We take the set of hosts that
+are not anon platform and not on scale probation (see below).
+The claimed credit of a job (in Cobblestones) is
+ C = F* 200/86400e9
+If replication is not used, this is the granted credit.
+If replication is used,
+we take the set of instances for which approx is false.
 If this set is nonempty, we grant the average of their claimed credit.
+Otherwise we grant
+ claimed_credit^mean^(A)*E(J)
+Otherwise:
+ if min_avg_pfc(A) is defined
+   C = min_avg_pfc(A)*E(J)
+ else
+   C = wu.rsc_fpops_est * 200/86400e9
 == Cross-project version normalization ==
 If an application has both CPU and GPU versions,
 the version normalization mechanism uses the CPU
+version as a "sanity check" to limit the credit granted to GPU jobs
 (or vice versa).
 Suppose a project has an app with only a GPU version,
 so there's no CPU version to act as a sanity check.
+the version normalization mechanism figures out
+which version is most efficient and uses that to reduce
+the credit granted to less-efficient versions.
+If a project has an app with only a GPU version,
+there's no CPU version for comparison.
 If we grant credit based only on GPU peak speed,
 the project will grant much more credit per GPU hour than other projects,