| | 301 | == Error rate, host punishment, and turnaround time estimation == |
| | 302 | |
| | 303 | Unrelated to the credit proposal, but in a similar spirit. |
| | 304 | |
| | 305 | Due to hardware problems (e.g. a malfunctioning GPU) |
| | 306 | a host may have a 100% error rate for one app version |
| | 307 | and a 0% error rate for another. |
| | 308 | Similar for turnaround time. |
| | 309 | |
| | 310 | So we'll move the "error_rate" and "turnaround_time" |
| | 311 | fields from the host table to host_app_version. |
| | 312 | |
| | 313 | The host punishment mechanism is designed to deal with malfunctioning hosts. |
| | 314 | For each host the server maintains '''max_results_day'''. |
| | 315 | This is initialized to a project-specified value (e.g. 200) |
| | 316 | and scaled by the number of CPUs and/or GPUs. |
| | 317 | It's decremented if the client reports a crash |
| | 318 | (but not if the job was aborted). |
| | 319 | It's doubled when a successful (but not necessarily valid) |
| | 320 | result is received. |
| | 321 | |
| | 322 | This should also be per-app-version, |
| | 323 | so we'll move "max_results_day" from the host table to host_app_version. |
| | 324 | |
| | 325 | == Cherry picking == |
| | 326 | |
| | 327 | Suppose an application has a mix of long and short jobs. |
| | 328 | If a client intentionally discards |
| | 329 | (or aborts, or reports errors from) the long jobs, |
| | 330 | but completes the short jobs, |
| | 331 | its host scaling factor will become large, |
| | 332 | and it will get excessive credit for the short jobs. |
| | 333 | This is called "cherry picking". |
| | 334 | |
| | 335 | The host punishment mechanism |
| | 336 | doesn't deal effectively with cherry picking, |
| | 337 | |
| | 338 | We propose the following mechanism to deal with cherry picking: |
| | 339 | |
| | 340 | * For each (host, app version) maintain "host_scale_time". |
| | 341 | This is the earliest time at which host scaling will be applied. |
| | 342 | * for each (host, app version) maintain "scale_probation" |
| | 343 | (initially true). |
| | 344 | * When send a job to a host, |
| | 345 | if scale_probation is true, |
| | 346 | set host_scale_time to now+X, where X is the app's delay bound. |
| | 347 | * When a job is successfully validated, |
| | 348 | and now > host_scale_time, |
| | 349 | set scale_probation to false. |
| | 350 | * If a job times out or errors out, |
| | 351 | set scale_probation to true, |
| | 352 | max the scale factor with 1, |
| | 353 | and set host_scale_time to now+X. |
| | 354 | * when computing claimed credit for a job, |
| | 355 | and now < host_scale_time, don't use the host scale factor |
| | 356 | |
| | 357 | The idea is to apply the host scaling factor |
| | 358 | only if there's solid evidence that the host is NOT cherry picking. |
| | 359 | |
| | 360 | Because this mechanism is punitive to hosts |
| | 361 | that experience actual failures, |
| | 362 | we'll make it selectable on a per-application basis (default off). |
| | 363 | |
| | 364 | In addition, to limit the extent of cheating |
| | 365 | (in case the above mechanism is defeated somehow) |
| | 366 | the host scaling factor will be min'd with a |
| | 367 | project-wide config parameter (default, say, 3). |
| | 368 | |