Context Navigation

Changes between Version 19 and Version 20 of CreditNew

Timestamp:: Nov 16, 2009, 1:03:49 PM (16 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

CreditNew

-                      v19
+                      v20
   subsequent jobs will be replicated.
+== Error rate, host punishment, and turnaround time estimation ==
+Unrelated to the credit proposal, but in a similar spirit.
+Due to hardware problems (e.g. a malfunctioning GPU)
+a host may have a 100% error rate for one app version
+and a 0% error rate for another.
+Similar for turnaround time.
+So we'll move the "error_rate" and "turnaround_time"
+fields from the host table to host_app_version.
+The host punishment mechanism is designed to deal with malfunctioning hosts.
+For each host the server maintains '''max_results_day'''.
+This is initialized to a project-specified value (e.g. 200)
+and scaled by the number of CPUs and/or GPUs.
+It's decremented if the client reports a crash
+(but not if the job was aborted).
+It's doubled when a successful (but not necessarily valid)
+result is received.
+This should also be per-app-version,
+so we'll move "max_results_day" from the host table to host_app_version.
+== Cherry picking ==
+Suppose an application has a mix of long and short jobs.
+If a client intentionally discards
+(or aborts, or reports errors from) the long jobs,
+but completes the short jobs,
+its host scaling factor will become large,
+and it will get excessive credit for the short jobs.
+This is called "cherry picking".
+The host punishment mechanism
+doesn't deal effectively with cherry picking,
+We propose the following mechanism to deal with cherry picking:
+ * For each (host, app version) maintain "host_scale_time".
+   This is the earliest time at which host scaling will be applied.
+ * for each (host, app version) maintain "scale_probation"
+   (initially true).
+ * When send a job to a host,
+   if scale_probation is true,
+   set host_scale_time to now+X, where X is the app's delay bound.
+ * When a job is successfully validated,
+   and now > host_scale_time,
+   set scale_probation to false.
+ * If a job times out or errors out,
+   set scale_probation to true,
+   max the scale factor with 1,
+   and set host_scale_time to now+X.
+ * when computing claimed credit for a job,
+   and now < host_scale_time, don't use the host scale factor
+The idea is to apply the host scaling factor
+only if there's solid evidence that the host is NOT cherry picking.
+Because this mechanism is punitive to hosts
+that experience actual failures,
+we'll make it selectable on a per-application basis (default off).
+In addition, to limit the extent of cheating
+(in case the above mechanism is defeated somehow)
+the host scaling factor will be min'd with a
+project-wide config parameter (default, say, 3).
 == Trickle credit ==