Context Navigation

Changes between Version 21 and Version 22 of CreditNew

Timestamp:: Nov 16, 2009, 4:43:53 PM (16 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

CreditNew

-                      v21
+                      v22
 and sets their scaling factor based on the above.
 == Replication and cheating ==
+== Cheat prevention ==
 Host normalization mostly eliminates the incentive to cheat
 …
 An exaggerated claim will increase VNPFC*(H,A),
 causing subsequent claimed credit to be scaled down proportionately.
 This means that no special cheat-prevention scheme
 is needed for single replications;
+granted credit = claimed credit.
+For jobs that are replicated, granted credit should be
+set to the min of the valid results
+(min is used instead of average to remove the incentive
+for cherry-picking, see below).
+However, there are still some possible forms of cheating.
+ * One-time cheats (like claiming 1e304) can be prevented by
+   capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A).
+ * Cherry-picking: suppose an application has two types of jobs,
+  which run for 1 second and 1 hour respectively.
+  Clients can figure out which is which, e.g. by running a job for 2 seconds
+  and seeing if it's exited.
+  Suppose a client systematically refuses the 1 hour jobs
+  (e.g., by reporting a crash or never reporting them).
+  Its VNPFC^mean^(H, A) will quickly decrease,
+  and soon it will be getting several thousand times more credit
+  per actual work than other hosts!
+  Countermeasure:
+  whenever a job errors out, times out, or fails to validate,
+  set the host's error rate back to the initial default,
+  and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A.
+  This puts the host to a state where several dozen of its
+  subsequent jobs will be replicated.
+== Error rate, host punishment, and turnaround time estimation ==
+Unrelated to the credit proposal, but in a similar spirit.
+Due to hardware problems (e.g. a malfunctioning GPU)
+a host may have a 100% error rate for one app version
+and a 0% error rate for another.
+Similar for turnaround time.
+So we'll move the "error_rate" and "turnaround_time"
+fields from the host table to host_app_version.
+The host punishment mechanism is designed to deal with malfunctioning hosts.
+For each host the server maintains '''max_results_day'''.
+This is initialized to a project-specified value (e.g. 200)
+and scaled by the number of CPUs and/or GPUs.
+It's decremented if the client reports a crash
+(but not if the job was aborted).
+It's doubled when a successful (but not necessarily valid)
+result is received.
+This should also be per-app-version,
+so we'll move "max_results_day" from the host table to host_app_version.
+in this case, granted credit = claimed credit.
+For jobs that are replicated,
+granted credit is set to:
+ * if the larger host is on scale probation, the smaller
+ * if larger > 2*smaller, granted = 1.5*smaller
+ * else granted = (larger+smaller)/2
+However, two kinds of cheating still have to be dealt with:
+=== One-time cheats ===
+For example, claiming a PFC of 1e304.
+This can be minimized by
+capping VNPFC(J) at some multiple (say, 20) of VNPFC^mean^(A).
+If this is enforced, the host's error rate is set to the initial value,
+so it won't do single replication for a while,
+and scale_probation (see below) is set to true.
 == Cherry picking ==
 …
 In this case segments play the role of jobs in the credit-related DB fields.
+== Error rate, host punishment, and turnaround time estimation ==
+Unrelated to the credit proposal, but in a similar spirit.
+Due to hardware problems (e.g. a malfunctioning GPU)
+a host may have a 100% error rate for one app version
+and a 0% error rate for another.
+Similar for turnaround time.
+So we'll move the "error_rate" and "turnaround_time"
+fields from the host table to host_app_version.
+The host punishment mechanism is designed to deal with malfunctioning hosts.
+For each host the server maintains '''max_results_day'''.
+This is initialized to a project-specified value (e.g. 200)
+and scaled by the number of CPUs and/or GPUs.
+It's decremented if the client reports a crash
+(but not if the job was aborted).
+It's doubled when a successful (but not necessarily valid)
+result is received.
+This should also be per-app-version,
+so we'll move "max_results_day" from the host table to host_app_version.
 == Job runtime estimates ==