| 301 | == Error rate, host punishment, and turnaround time estimation == |
| 302 | |
| 303 | Unrelated to the credit proposal, but in a similar spirit. |
| 304 | |
| 305 | Due to hardware problems (e.g. a malfunctioning GPU) |
| 306 | a host may have a 100% error rate for one app version |
| 307 | and a 0% error rate for another. |
| 308 | Similar for turnaround time. |
| 309 | |
| 310 | So we'll move the "error_rate" and "turnaround_time" |
| 311 | fields from the host table to host_app_version. |
| 312 | |
| 313 | The host punishment mechanism is designed to deal with malfunctioning hosts. |
| 314 | For each host the server maintains '''max_results_day'''. |
| 315 | This is initialized to a project-specified value (e.g. 200) |
| 316 | and scaled by the number of CPUs and/or GPUs. |
| 317 | It's decremented if the client reports a crash |
| 318 | (but not if the job was aborted). |
| 319 | It's doubled when a successful (but not necessarily valid) |
| 320 | result is received. |
| 321 | |
| 322 | This should also be per-app-version, |
| 323 | so we'll move "max_results_day" from the host table to host_app_version. |
| 324 | |
| 325 | == Cherry picking == |
| 326 | |
| 327 | Suppose an application has a mix of long and short jobs. |
| 328 | If a client intentionally discards |
| 329 | (or aborts, or reports errors from) the long jobs, |
| 330 | but completes the short jobs, |
| 331 | its host scaling factor will become large, |
| 332 | and it will get excessive credit for the short jobs. |
| 333 | This is called "cherry picking". |
| 334 | |
| 335 | The host punishment mechanism |
| 336 | doesn't deal effectively with cherry picking, |
| 337 | |
| 338 | We propose the following mechanism to deal with cherry picking: |
| 339 | |
| 340 | * For each (host, app version) maintain "host_scale_time". |
| 341 | This is the earliest time at which host scaling will be applied. |
| 342 | * for each (host, app version) maintain "scale_probation" |
| 343 | (initially true). |
| 344 | * When send a job to a host, |
| 345 | if scale_probation is true, |
| 346 | set host_scale_time to now+X, where X is the app's delay bound. |
| 347 | * When a job is successfully validated, |
| 348 | and now > host_scale_time, |
| 349 | set scale_probation to false. |
| 350 | * If a job times out or errors out, |
| 351 | set scale_probation to true, |
| 352 | max the scale factor with 1, |
| 353 | and set host_scale_time to now+X. |
| 354 | * when computing claimed credit for a job, |
| 355 | and now < host_scale_time, don't use the host scale factor |
| 356 | |
| 357 | The idea is to apply the host scaling factor |
| 358 | only if there's solid evidence that the host is NOT cherry picking. |
| 359 | |
| 360 | Because this mechanism is punitive to hosts |
| 361 | that experience actual failures, |
| 362 | we'll make it selectable on a per-application basis (default off). |
| 363 | |
| 364 | In addition, to limit the extent of cheating |
| 365 | (in case the above mechanism is defeated somehow) |
| 366 | the host scaling factor will be min'd with a |
| 367 | project-wide config parameter (default, say, 3). |
| 368 | |